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Chapter 1 
Overview 


[Last Modified $Id: chipoverview.lyx 25116 2006-09-07 12:56:52Z wsnyder $] 
The SiCortex node chip, ICE9, is the building block for SiCortex dense clusters. Its architecture aims for careful 
balance between processing power, memory bandwidth, fabric latency, and I/O capability. 


1.1 Some History 


Way back in January of 2002, Jud Leonard and Matt Reilly got together to figure out what they might be able 
to do, given they were interested in systems and silicon. So they started talking to people. Lots of different and 
often strange people. ! 

One of the conversations, with Tom Knight of MIT, turned to high performance technical computing. Tom 
suggested that what the world needed was a “physics engine” — a device that was specifically designed to solve 
N-dimensional equations and systems. The traditional supercomputer? makers had been in decline for some time. 
As a result, there wasn’t a whole lot of interesting development going on in the field. 

Except for clusters. 

After that initial conversation, Jud, Matt, and Bryce Denney did a boatload of research.* They found that, 
while the old-fashioned supercomputer and vector computer business had all but died, and traditional symmetric 
multiprocessors were overpriced and underwhelming in the technical market, the cluster server market was booming. 
Everywhere they looked, from the Oil Patch (Shell, Exxon/Mobil) to biochemistry, big iron was being replaced by 
clusters of PC boxes connected with Ethernet or some expensive point-to-point interconnect. 

All these machines were being built by the customers. Big iron - like the SGI Origin, the IBM SP2, and the 
various HP/Compaq machines - was just too expensive. At more than $10,000 per SMP processor node, most 
customers were abandoning shared memory systems for networked clusters of workstations or 1U rackmount PCs 
running Linux. The customers had even adopted a common API, called “MPI” (for “Message Passing Interface”) 
as they converted old shared memory codes into message passing applications. But they all ran up against three 
problems. 

First, PC clusters aren’t very efficient. We found that the typical application in our target markets would 
execute about 15 floating point operations (FLOPs) for every access to main memory. So, given a memory access 
time of about 120nS and a floating point execution rate of, say, infinite FLOPs per second, the average execution 
rate is just L25MFLOPs. Think about that: customers pay for a widget that runs at 3GHz and can crank out 6 
billion floating point operations per second, but they only get 2% of that. All the logic that goes into building a 
whizzy fast FPU is wasted on these applications. Worse, the 3GHz processor burns about 100W while it spends 
most of its time waiting on memory.* 


1 This is, by no means, an exhaustive history of SiCortex and how we came to be here. The aim here is to outline the thinking and 
exploring that led to the current architecture. To keep things brief, I’ve left out the equally important (and far more interesting) story 
of how we managed to convince four intrepid VC firms to invest in SiCortex. 

?SiCortex defines a “supercomputer” as a high performance machine that costs more to make than the market will pay. We do not 
intend to make a “supercomputer.” 

3 Google rocks. 

4Note that we don’t believe that the PC designers are misguided. The hell-for-leather strategy that pushed clock rates is a reasonable 
thing to do for applications that fit in cache. Unfortunately, few technical apps fit in cache — even a very large cache. Desktop apps, 
however, fit quite nicely. When you consider that $100B is spent on desktop computers every year (compared with $5B or so on 
technical servers) the big chip guys are probably designing the right widget for their target market. 
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Second, PC clusters tend to be big. High density clusters might fit 2 Opteron or Xeon processors in a 1U rack 
slot. But dense packaging produces lots of heat in a small volume. The problem is aggravated by the fact that the 
building block is a 1U box. A 1U box is just 1.75” high. It is very hard to jam all the parts of a PC in such a box 
and still have room for airflow. All this conspires to spread a typical 100 node cluster over three or four racks. 

Finally, parallel applications on PC clusters are limited by the long latency for message passing. Customers 
have migrated from shared memory machines to message passing clusters. They developed the MPI specification (it 
grew out of earlier work on PVM and other message passing schemes) and have implemented it on hardware ranging 
from simple ethernet controllers to Infiniband, Quadrics, and Myricom hardware. Ethernet based implementations 
typically impose a cost of 50uS. For about $1,000 per node users could add Infiniband, Quadrics, or Myricom 
hardware that could get that latency down to abou 5uS. Our models show that the 500nS latency of the SiCortex 
dense fabric could allow applications to scale to ten or even one hundred times as many processors. 

The world probably didn’t need a physics engine. But it looked like the world might buy a cluster that was 
built to run technical applications. 


1.2 The System 


The SiCortex Dense Cluster is founded upon four piers: 


1. Optimize the balance between raw compute rate (FLOPS or Integer Ops/second) and memory latency and 
bandwidth. 


2. Provide low-latency user-mode to user-mode transactions to support MPI. 
3. Manage power to provide a high ratio of delivered performance per watt. 
4. Aim for an order-of-magnitude advantage in delivered performance per dollar. 


This last point is the raison d’etre® for the SiCortex cluster, so we’d better describe what we mean by “delivered 
performance.” Our model of a technical computation divides the work into three parts: calculation, memory access, 
and communication. So the time to complete a computation is: 


T. = Teale + Tinem 1 Teomm 


Our survey of the applications in the target markets yeilded a large number that, as we said, had a ratio of memory 
accesses to floating point operations of 1:15. We ignore all the other operations, as most processors will find a way 
to execute them in parallel (or nearly so) with the floating point ops, or will execute them in parallel with the main 
memory access. So let’s assume that we have an application that needs to do M floating point ops. Then the time 
to completion is 


M 
To => MTFriop + Ts Laccess + Teomm 


For a modern (say 3GHz P4) processor Triop = 0.15nS and Tyecess = 120nS. Cranking that in to our model: 


120 120 
T. =M (01s + =) ar Tcomm Me at Teomm =8M+ Teomm 


The time to complete the calculation is irrelevant. Applications in this class are all about moving data, and not 
about doing arithmetic. 

That, of course, still leaves the communications (Tcomm) component. We did a few measurements and found that 
many applications fell into a range where 1,000 to 100,000 FLOPs were executed for every message sent. So, we 
cranked in a few numbers. Typical Ethernet based implementations of MPI will consume about 50uS of processor 
time for every message. Using a rate of say 10K FLOPs per message we get 


T. % 8M+ a 50000 = (8 +5)M 
i 10000 7 


Note that of the computation is consumed by communication overhead. In actual practice, as the number 
of processors applied to a problem is increased, the ratio of communication operations to arithmetic operations 
increases. This is one of the key limiters on parallelism in our target market. As the communication rate approaches 


5 raison d’etre (pronounced “rayzohn debtr”) French for “reason to be.” Often used when an author wants to sound classy. 
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one message for every thousand FLOPs, the communication overhead begins to dominate the solution time. The 
SiCortex solution is to reduce the cost of communication to 500nS per operation. This allows practical scaling to 
many more parallel processes. 

Half a microsecond per message is a pretty tough goal. The best-in-class PCI-resident fabric widgets from 
Myricom or Quadrics get down to 5uS or so. We thought about that problem for a while. What we noticed is that 
all the previous solutions treated message operations as I/O transactions. They had to: the message widget was 
out on an I/O bus. But most I/O systems, if they are optimized at all, are optimized for bandwidth, not latency. 
Good MPI support means providing low latency for short transfers, and high bandwidth for long transfers. Putting 
an I/O bus (and operating system code, and drivers, and buffer copy operations) between the user’s application 
and the message system puts PC based solutions between a rock and a hard place. 

The SiCortex approach is to elevate message operations above the I/O system. By closely coupling the fabric 
interface to the L2 cache, virtualizing the interface between user mode applications and the fabric, and providing 
very low latency message routing, the SiCortex system can provide a 10x improvement in message latencies over 
previous best-in-class approaches. The rest of this document describes the approach in detail. 


1.3 ICE9 


ICE9 is the central component of a large-scale parallel computer system designed to run technical applications 
— specifically, those which require large amounts of memory and floating point arithmetic — with superior efficiency. 
It will run Linux well, and in particular, provide extraordinary performance to MPI, the message passing interface. 
And we will keep the cost very low. 

We will integrate in a single device most of the electronic components needed for the system — microprocessors, 
caches, memory controllers, fabric switch, DMA engine, and PCI-Express interface. Excluded from the chip are 
main memory (commodity DRAMs), point-of-load power regulators, and the control/management system. 

The fundamental insight behind SiCortex is that faster processor clock speed is no longer an effective way of 
improving time to solution; that in fact, most parallel technical applications spend the bulk of their time waiting 
for memory and/or communication between processors. 


1.3.1 Goals 


Latency We often measure and advertise bandwidth, which sets a strict limit on the throughput available from 
computer systems, but it’s useful to recognize that in most circumstances, latency is the more immediate limitation, 
because it is generally difficult to get enough parallel activity underway to use the full bandwidth unless each action is 
brief. This design focusses on main memory (cache miss) latency, which is the primary determinant of single-stream 
performance in this market, and on MPI communication latency, the time required to get a short message (ping) 
from a user-mode process on any processor to a waiting user-mode process on the most distant other processor. 


e Our goal for the memory latency, measured from a load instruction to use of the data, is 80 ns. 
e Our goal for the one-way communication latency, measured by the MPI Ping-Pong test, is 500 ns. 


e Our goal for memory bandwidth is 6.4 GBytes/sec, as measured by the McCalpin stream tests. 


Power Careful and concerted attention to minimization of power is key to the success of the SiCortex product. 
By using a small, low-power microprocessor at its most efficient operating point, we are able to keep its cost very low 
and spread the computational workload over a much larger number of streams. This results in far better utilization 
of the memory system, which is the bottleneck for delivered performance, but depends on keeping communication 
delays minimized. 


e Our goal for the power dissipation of the chip is 8-10 watts. 


Reliability Large-scale systems are particularly sensitive to reliability concerns, for several reasons. On the one 
hand, the statistical probability of failure is proportional to the number of components, so large systems with many 
components suffer inherently lower reliability than systems with fewer components. On the other hand, people buy 
large systems because they have long-running tasks and strong economic incentives to get them finished quickly, so 
system failures create direct financial consequences for the system owners. 

The SiCortex system employs a number of techniques to maximize the reliability of the system from the user’s 
perspective. 
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e Power consciousness: the system is designed to run cool in the worst case under heavy load, and cooler still 
when idle, to keep the the reliability high. 


e N+1 redundancy of power and cooling systems: Power distribution, from the mains to the module level, is 
designed with inherent redundancy, so that a failure within the power supply will not cause any interruption of 
service. Similarly, cooling fans are individually replaceable, and provide enough capacity to maintain specified 
thermal limits even with one fan inoperative. 


Dual redundancy of the control and management system, allowing the system to survive failures in the control 
system without effecting normal operation. 


e Modular, message-passing hardware/software architecture: failures of a compute node, its memory, or the 
fabric switches do not force system failures. The fabric architecture is able to route around faults, and the 
software system is able to restart a checkpointed process on a different processor, so that a failure of one node 
need not terminate the application(s) using that node. 


Full SEC/DED Error Correcting Code on main memory and L2 cache, to provide fully automatic recovery 
from transient and permanent single-bit errors as well as early warning of deteriorating devices so that they 
can be replaced during scheduled maintainance. 


1.4 Overall Block Diagram 


1.4.1 Processor Cores 


ICE9 contains six Mips 5KF processor cores. Each core implements 32KB of instruction cache, 32KB of data 
cache, and a 256KB “slice” of the shared L2 cache. The L1 data caches, and the shared L2 cache, are coherent. 


1.4.2 L2 Cache 


ICE9 implements a shared 1.5MB L2 cache. The cache is composed of 256KB slices that are local to a core. 
The L2 cache controller implements global coherency across ICE9. 


1.4.3. Memory Controller 


ICE9 implements two DDR2 SDRAM memory controllers. Each controller interfaces to one 72b (ECC) unreg- 
istered DDR2 DIMM. This provides for 2GB of memory per node at system FCS, with expansion to 4GB and 8GB 
as memory technology improves. 


1.4.4 PCI-Express Controller 


ICE9 implements a PCI-Express controller for I/O. The controller implements 8 PCI-Express lanes, providing 
20Gbps of I/O bandwidth per node. 


1.4.5 Fabric 


ICE9 implements the SiCortex FastFabric, providing three Receive Links and three Transmit Links per node. 


1.4.5.1 DMA Engine 

The DMA Engine interfaces between the L2 cache and the Fabric switch. It is optimized for MPI operations 
and allows user applications to send and receive data without invoking the operating system kernel. 
1.4.5.2 Fabric Switch 


The Fabric switch implements a four-port crosspoint switch among the three fabric links and the DMA engine. 
The switch provides cut-through routing to minimize latency on packets that are destined for another node. The 
switch also implements full flow control and error retry to ensure reliable transmission and reception. 
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1.4.5.3. Link Controllers 


Each link controller implement a single FastFabric transmit or receive link. 


1.4.5.4 Link Subsystem 


The “Link Subsystem” refers to the Fabric Switch, plus the 3 Link Receivers, plus the 3 Link Transmitters. 
When message is “not for me”, it gets sent on to the next ICE9 along the way to it’s destination, only passing 
through the SiCortex FastFabric, and doesn’t even enter the DMA Engine in this ICE9. This message will come in 
a Receive Link, pass through the Fabric Switch, and exit out a Transmit Link. When possible, no “store & forward” 
occurs, with the Fabric Switch immediately knowing which Transmit Link to use from first-FORD information. If 
the Transmit Link was available, the beginning of the message is already on the outgoing wires to the next ICE9 
before the end of the message has entered this one. 


1.4.6 Clock Generator 


The clock generator provides internal clocks for ICE9. It generates separate clocks for the cores and L2 caches 
(nominally 500Mhz, but variable), the PCI-Express controller (always 250Mhz), the memory controller (266Mhz, 
333Mhz, or 400Mhz), and the fabric (nominally 200Mhz, but variable). 


1.4.7 Miscellaneous 


Other on-chip components include the JTAG controller, the on-chip logic analyzer, and on-chip peripherals such 
as 12C, UART, etc. 


1.5 Latency Calculations 


1.5.1 Links and Wire-Handling Latency 


Latency involved with Link Transmitter and Link Receiver handling sending over a differential pair link between 
two ICE9 ASICs. The wire propagation delay itself is not included here, but will be included in the table further 
below. 


Unit or Action Explanation 

Transmit Link unit 4.2 ns | From flopped-in till first bit out on serial line. 
See Internode Link chapter. 

9 more bits onto wire 4.5 ns | Since Transmit Link latency is till first bit out, and Receive Link 
latency is from when last bit in, we must add this time. 


Receive Link unit 15.75 ns | From last bit in on serial line till flopped-out to Fabric Switch. 
See Internode Link chapter. 

receive synchronization 2.25 ns | Receive Link must synchronize incoming 10-bit characters with the 
local s-clock. This takes 0 to 4.5 ns, depending on phase. 

Tak Sibson TOTAL ‘| 307m] ——SSCS~S~—~—~—SCSC“CSCSC*S 
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1.5.2 ICE9 to ICE9 Latency 


[sending IOED softwacacions | | OOS 
[sending TOE9: Processor Hardware [| _? | 
sending TOES" Contral Switch [7 [SSS 
l= ee el 


sending ICE9: Fabric Switch ? (>15ns) | From when DMA Engine gives transfer to Fabric Switch, till 
flopped-in by Transmit Link. 


6 hops: “Links and Wire-Handling” 160.2 ns | 6 times the 26.7 ns from table above 
6 hops: Wire Delay ? | 6 times the average wire delay ICE9-ICE9 


5 pass-thru ICE9’s: Fabric Switch 75.0 ns | 5 times minimum Fabric Switch pass-thru latency of 15ns. 
Defined as from flopped-out by Receive Link till flopped-in 
by Transmit Link. See Fabric Switch chapter. 


receiving ICE9: Fabric Switch ? (>15ns) | From when flopped-out by Receive Link, till when given to 
DMA Engine. 


receiving IOHD: DMA Fngine see 
receiving ICHO: Central Switch ———————d 


receiving ICE9: Processor Hardware 


receiving unit: software actions Q 


6-Hop TOTAL 


5.5 Hop TOTAL From the 6-Hops total, subtract 20.8 ns and 1/2 of one av- 
erage wire delay. 


1.6 Address Map 


All processor cores in an ICE9 see an identical view of the 36 bit physical address space. The address pace 
is split into three major types of sections: cachable memory space, IO space, and PCI-Express spaces. For more 
details, see 16. 
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Internode Link 


[Last Modified $Id: link.lyx 51024 2008-02-15 20:37:33Z rwoodscorwin $] 


2.1 Overview 


The SiCortex fabric link (we’ll call it “the link”) is data link with embeded clock, eight-bit wide, differential, 
all copper, parallel path with a companion serial flow control path. That is, the link is eight lanes of diff pairs, 
plus one more lane traveling in the opposite direction to carry flow control information. The eight parallel lanes 
carrying data between nodes is called a “Data Link” or “DL”. 

Each lane is implemented as a SERDES channel at raw data rate of 2 Gbit/S per lane, or 2 GByte/S per link. 

We expect the physical design of the link to be a challenge: some links will traverse only a few inches of PCB 
trace, while others may travel through several inches of PCB, a connector, up to 30” of backplane, another connector, 
more backplane, yet another connector, and several more inches of PCB. While daunting, we are encouraged by the 
fact that several switching systems are carrying significantly higher data rates in similar environments, and that 
the technology behind channel compensation, reflection cancellation, and low-loss materials have put the SiCortex 
fabric signalling scheme well within the bounds of current technology. 

In order to maintain DC balance on each lane, and in order to detect data-corruption, data traveling on each 
lane is encoded using a 10B/8B code. Each of the 256 possible 8 bit symbols is recoded into a choice of either of 
two 10 bit “characters”. Out of the 1024 possible 10 bit characters, the encoding only uses those where the number 
of “1” bits is one greater, one less, or equal to the number of “0” bits. If the number of “1” bits is in excess or deficit, 
the code is arranged so that at the end of the next symbol transmission the net excess or deficit (over N symbols, 
for all N) is never greater than 1. That’s why there are two 10-bit encodings available for each 8-bit data symbol. 

Using 8B/10B encoding scheme, the minimum chunk of data that can be sent over the 8-lane link is 64bits wide 
every 5nS. We call this chunk a “FORD” (for “Fabric wORD”). 

The 10B/8B code we have chosen allows for a number of valid 10 bit symbols that have no mapping into the 8 
bit space. We use six of these symbols as control and management markers for our link protocols. We use: 

K28.0 for ANULL (alternate NULL) 

K28.1 for SOLS (start of LinkSync) 

K28.2 for EOLS (end of LinkSync) 

K28.3 for SOP (start of packet) 

K28.4 for EOP (end of packet) 

K28.5 for NULL 

You may run into the term “ES_COMMA” which means “NULL or ANULL”. 

We use NULL as an “idle” symbol when the link has no other data to carry, as well as for other purposes. Link 
will carry data in variable length packets. Each packet begins with an SOP (start of packet) character in lane 0, 
and ends with an EOP (end of packet) character in lane. 

To keep interfaces clean and to reduce the amount of byte shuffling that goes on in the fabric part of the chip, 
we'll pass entire FORDs on to the fabric switch logic. The switch datapath is 64bits wide and runs at 1/5 the 
fabric clock, called the “Switch Clock” or “sclk.” 

For each 8 bit parallel link from node A to node B, there is a one bit wide serial channel from node B to node 
A. This link, called the “Control Lane” is used to convey flow-control and buffer status information from a receiving 
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node back to the node at the other end of the data link. The control lane uses the same 8B/10B dc balance scheme 
as the data link. As a result, control link tokens are 8 bits wide and arrive every 5nS. 

To communicate between two chips, one chip has an FLT (Fabric Link Transmitter) and the other chip has an 
FLR (Fabric Link Receiver). Between the two chips, Eight data lanes go uni-directionally from the FLT to the 
FLR, and one control lane goes uni-directionally from the FLR back to the FLT. Inside each chip the FLT or FLR 
connects to an FSW unit. FSW is described in chapter “The Dense Fabric Switch”. 

Once a Link has been initialized and is sending traffic, three types of Packets are used. Over the 8-lane-wide 
data path are sent Data Packets or Idle Packets. Over the one control lane are sent Control Packets. The format 
of these packets are described near the beginning of chapter “The Dense Fabric Switch”, in sections The Data Link 
and The Control Link. 

In these packet formats you will see NULL, ANULL, SOP and EOP, which are recognized by this Internode 
Link unit for control and management purposes. All other fields within these three packet types will be constructed 
from the normal 256 8-bit data characters, and are treated as payload by Internode Link and just passed through. 


2.2 Differences, Bugs, and Enhancements 


2.2.1 Product and Chip Pass Differences 
1. NEED IMPL: TWC9A fixes certain noise patterns from causing fabric deadlocks, bug2132. 


2. NEED IMPL: All FL internal counters’ increment signals should be wired into the SCB counters, bug3488. 


2.2.2 Known Bugs and Possible Enhancements 


1. Force retraining should always complete, and software shouldn’t have to detect and implement retries. 


2. The out-of-band path was never used by software, and could be removed for simplicity if desired. 


2.3. Reference Documents 


AnalogBits QPMA cores are used within the Links to directly drive and receive the differential signals. 
AnalogBits documention is checked-in with svn in directory <project>/specs/ice9/AnalogBits/ 

These are relevant: 

ABIPCCE2_datasheet_20051021v2.pdf “ABIPCCE2 Custom PLL DATASHEET”. 
serdes_PRM._Sicortex_v1_1_2_051130.pdf “Serdes PMA Programmer’s Reference Manual”. 
serdes_test_guidelines_SiCortex_v1_1_1.pdf “Serdes PMA Test Guidelines” 


2.4 SERDES Fabric Links 


The SiCortex fabric link is eight lane wide in one direction and one lane wide in the other direction. Each lane 
is implemented as a high speed serial channel at the raw data rate of 2 Gbs per lane. Each lane will use a SERDES 
transmitter /receiver scheme. 

In a SiCortex chassis, fabric links are used for inter-ICE9 data exchange among all 972 ICE9 nodes. Each ICE9 
connects to six fabric links, three of those via receive ports while the other three are connected to transmit ports. 
Each link is a point to point connection between two nodes, so there is a total of (972 x 6)/2 = 2916 fabric links 
in a chassis. 

Each fabric link is built and operates autonomously. The primary function of the fabric link subsystem design is 
(a) to acquire lane framing on all lanes, (b) to acquire word framing among the eight serial lanes in a data link, (c) 
to acquire synchronization of the link i.e. bring state of fabric link subsystem to make it usable for data exchange by 
fabric switch at both ends of link, (d) once link synchronization is acquired then monitor fabric link to detect error 
conditions and when an error is detected then log the error, (e) after acquiring link synchronization continuously 
test for loss of link synchronization, and perform re-synchronization of the fabric link when synchronization is lost. 

The fabric link subsystem is built using two basic building blocks which are designed by the third party vendor, 
AnalogBits Inc. They are the lane transmitter which has SERDES PHY, impedance calibration circuitry, PLL, 
and the lane receiver, which has clock and data recovery circuit. The detailed description of the basic building 
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blocks is followed by the description of the Fabric Link Transmitter (TxLink or FLT) and the Fabric Link Receiver 


(RxLink or FLR). 


2.5 8B/10B code 


The 8B/10B code is implemented as per IEEE 802.3-2002 specifications. 


2.6 The Lane Transmitter (Txlane) 


A lane transmitter data channel is shown in Figure 2.1. A 10-bit wide data path begins at LaneEncoder. The 
Txlane latches 10-bit data in aTxDI[9:0] register, serializes it, and transmits serialized bit stream on transmitter 
PHY. The Txlane transmits LSB (aTxDI[0]) bit first in time and MSB (aTxDI[9]) bit last in time. The data 
transfer rate is equal in both modules and it is at 10 bits every 5nSec or 10-bits at 200 MHz. 


From 
SWITCH 


Loopback 
& 
BitBlasting 


One Lane Transmitter Channel 


LaneEncoder 


Imp/Cal settings 


reset_| 


(10-bit) 
encoded data 


pll_lock 
sclk 


Txlane 


Impedance 
Calibration 
Circuit 


aTxDI[9:0] Transmitter 


PLL 
txclkP 


s_refclk INV 


s_refclk (200Mhz) 


Figure 2.1: Transmitter Lane 


aTxClk_Stable 
aRefclkP 


NearEndLoopback 


The Txlane module has PLL which receives inverted copy of sclk (200 MHz) as refclk and generates txclkP (200 
Mhz) in known phase relationship with refclk, which is 2-3 bit period plus propagation delay on internal quad clock 
tree. The Txlane module uses txclkP as a strobe timing reference signal to transfer data from LaneEncoder. The 
PLL also asserts a signal, called aTxClk_Stable, indicating when TxClkP is stable and when internal clocks are up 


and stable. 
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LaneEncoder operates in sclk domain at 200MHz. LaneEncoder supports 8-bit wide data path from either 
Fabric Switch or from loopback path within link interface. It has one buffered data stage in sclk domain to perform 
8B10B conversion of data. It generates 10-bit wide encoded data every clock tick (at 200 MHz) for Txlane. 

The 8B10B tables within TX LaneEncoder have 10-bit busses [9,8,7,6,5,4,3,2,1,0] mapped as [a,b,c,d,e,i,f,g,h,j] 
on them. TxLane from AnalogBits serialize 10-bit busses [9,8,7,6,5,4,3,2,1,0] such that bit-0 goes first on the serial 
line, bit-9 last. So, to send bits in the correct order as per IEEE 802.3-2002 spec, LaneEncoder transmits 10-bit 
bus mapped as [j,h,g,f,i,e,d,c,b,a] to the Txlane. 

The data transfer between LaneEncoder and Txlane is synchronous and described in section-2.6.1. The Txlane 
drives serial data on transmitter PHY at the data rate of 2gbs (giga bits per sec). 

The transmitter impedance calibration circuitry controlling Txlane is described in section-2.19.1. 


2.6.1 Synchronizer setup between sclk and txclkP 


The data transfer between sclk and txclkP is considered synchronous transfer. The synchronous transfer between 
sclk and txclkP will be achieved by balancing clock layout and placement constraints among synchronizing cells. In 
each Txlane, there are eleven (11) clock endpoints or targets. Of those 11 endpoints, LaneEncoder has 10 endpoints 
as clock pins of flops and Txlane has one endpoint as the aRrefclkP input to PLL. The endpoint in TxLane to 
aRefclkP will be of inverse polarity than to 10 endpoints in LaneEncoder. The design intent is to balance clock 
tree from common source to 11 endpoints or targets for each Txlane. There are total of 27 Txlanes in ICE9, hence, 
total of 27 groups of 11 endpoints will be balanced. 

The PLL of Txlane generates txclkP which has its rising edge within 2 to 3 bit times (i.e. between 1 nsec- 
1.5nsec) of serial data rate plus propagation delays on internal clock tree. Design intent is that TxLane will latch 
data on the rising edge of txclkP. 


NOTE : The txclkP clock will not be used by LaneEncoder. For LaneEncoder, txclkP is the implicit 
clock. However, the goal of synchronous transfer between sclk and txclkP is to meet setup and hold 
times wrt txclkP in Txlane. 

The synchronous transfer between sclk and txclkP is achieved by allocating timing budget for timing components 
on clock and data path. The timing diagram of figure-2.2 shows delay component of clock and data path. 


sclk @ 200Mhz 
[6 S55.05) 
s_refclk @ pinsof ICE9 EE eens EE eee | eee 


-+|/e{0,0.05} 


sclk (among 10 flops) |/ \ f \ f \ 
L_uFlop-Delay i 
encoded_data iS ———————— a 
LiNet-Delay 1-»/{0.05,0.1} 
aTxD| (9:0) Sass 
<—+{2.2,2.8} 4} 
S_REFCLK(duty cycle variation) / ) 
}e—x{0.02,0.05} 
s_refclk_INV@ PLL \ | 
}e—+|{-0.04,0.04} 


txclk (with PLL jitter) 


<4(1.1.5) 


ixclk (with PLL delay) 


Hold 
e{0.21,1.77} 


07 lt 8.0.5) 


Figure 2.2: synchronizer handshake 


txclkP 


NOTE: The clock distribution network in chassis will affect the timing budget of ICE9’s internal 
clock which may not be represented adequately in the timing budget. The s_refclk is signal at the pins 
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of ICE9. The sclk is clock signal driving 10 target flops in the LaneEncoder. The output of 10 target flops stabilizes 
at aTxDI[9:0] receiver in Txlane. The data setup and hold checks are to be performed on aTxDI[9:0] wrt txclkP. 
A copy of s_refclk is shown with duty cycle variation as S.REFCLK. The inverted copy, called s_refclk_INV, is 
connected to aRefClk pin of PLL in Txlane. The txclk is the clock output of PLL with PLL jitter spec. The txclkP 
is the clock output of Txlane wrt to which the setup and hold constraints must be met. 
Following spreadsheet specifies timing budget of each component. Note that final design implementation goal 


is to have setup and hold margin equalize at aTxDI[9:0] cells. 


Row Name Formula Min Max Margin Comment 
1 V_ sclk_period 5 5 5 : 
2 V_r_jitter 0.05 0.05 0.05 ICE9 clock input spec.= eee 
3 V_ f_jitter 0.05 0.05 0.05 ICE9 clock input spec. = 50ps 
4 V_ duty_cycle 45 45 45 ICE@ internal spec. = WC 45/55 
5 V_ duty_cycle_min (et Sere coca Bee aiod pe-0.25 -0.25 
6 V_ duty_cycle_max (100-duty_cycle)*sclk_period)/100)-(s 0.25 0.25 
7 V_ recal_dcycle duty_cycle_min,duty_cycle_max] -0.25 0.25 : 
8 V_sclk_mismatch 0,005). 0 0.05 ICE9 internal spec. - BC, WC = 0-50ps 
9 V_ flop_delay 0.05,0.3] 0.05 0.3 ICEQ internal spec, - BC, WC = [50ps, 300ps] 
10 V_ delay line 0 0 0 ICEQ internal spec - delay line (if required), BC, WC = [0, 0] 
11. V_ rc_delay 0.05,0.1] 0.05 0.1 ICE internal spec - BC, WC = [50ps, 100ps] 
12 V_net_dela' delay ling+rc_delay 0.05 0.1 
13 ¥ IN¥_mismatch_in_clk 0.02,0.05] 0.02 0.05 ICE9 internal spec - BC, WC = [20ps, 50ps] 
15 V Dl jitter -0.04,0.04] -0.04 0.04 AnalogBlts PLL spec - WC, short term [eet = 4ps, long term jitter = 40ps 
16 V_opll_delay 1,1.5] 1 1.5 arog spec. - BC = 2-bit time, WC = 3-bit time 
17 ~V~ clktree_delay 0.5 0.5 0.5 AnalogBits spec. - BC, WC (estimated) = [500ps, 500ps] 
18 V AB Setup 2 2 2 AnalogBits spec - Setup Constraint of 2ns 
19 V AB_Hold 0 0 0 AnalogBits spec - Hold constrint of Ons 
20 C Set-Up AB_Setup,] 2 <1.23,><1.23,> <==== Setup Margin 
al g Ao ‘AB_Hold,] 0 <0.21,><0.21,>  <==== Hold Margin 
p 


Table 2.1: Timing Budget Spec sheet 


2.6.2 Txlane data latency estimates 


Data transfer latency estimates are presented below. Latency is calculated from loading of encoded_data to 
LSB (first) bit on transmitter PHY. 


ee 
sclk-to-txclkP | encoded_data | aTxDI[9:0] 
aTEDIO 


TBD 
TXDP/N 
Total Delay TXDP/N 


2.6.3. Txlane module ports (This port list is not complete. Needs portlist Spec from 
AnalogBits) 


Tr 
| Ss Transmit Clock signal at 200 Mhz. This signal is not used. 
aTSIB 


aTxClk_Stable CSR module Status signal from PLL indicating the transmit clock is stable. 


aTxDI[9:0] 10-bit data which is 8B10B encoded. Txlane accepts this data 


at frequency of sclk (200 MHz). 


LinkEncoder 


TXDP/TXDN 


2.6.4 8B10B code Validation Plan 


Verification team will get 8b10b code standard from IEEE 802.3 (ethernet) spec. Verification team will verify 
and validate each Txlane against 802.3 spec, including negative cases of errors. 


2.6.5 Verification Checklist: (This section is not complete) 
1. Verify sclk/txClkP synchronizer settings 


2. Verify reset function 
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3. Verify driving bad disparity function tx 
4. Verify driving invalid character on tx 


5. Verify NearEndLoopback mode 


2.7 The Lane Receiver (Rxlane) 


A lane receiver data channel is shown in Figure-2.3. The Rxlane module of the diagram will be delivered by 
AnalogBits, Inc. 

The receiver datapath begins in Rxlane at the differential inputs, Rxdp and Rxdn, of the SERDES receiver PHY. 
The baud rate at Rxdp/Rxdn is 2Gbps. The embeded data and clock signals from PHY are separated by the Rxlane. 
The Rxlane de-serializes incoming data stream, and drives databus aRxDO[19:0] and clock aRxCIKN in the source 
synchronous mode to the Framer module. The aRxCIKN signal is extracted clock from incoming data stream and it is 
operating at 200 MHz. The content of aRxDO[19:0] has data fields in the form of <current_10bits,previous_10bits>. 
The Rxlane transmits MSB (aRxDO/[19]) bit which has the most recent bit arrived on PHY and LSB (aR<xDO/0}) 
bit which has the earliest arrived bit on PHY. 

The 8B10B tables within Framer has 10-bit busses [9,8,7,6,5,4,3,2,1,0] mapped as [a,b,c,d,e,i,f,g,h,j] on them. 
The RxLane from AnalogBits de-serialize 10-bit bus [j,h,g,f,i,e,d,c,b,a] such that bit-a is received first from the 
serial line, bit-j] last and drives it in that order to Rxlane. So, to receive bits in the correct order as per IKEE 802.3 
spec, Framer maps received 10-bit bus as [a,b,c,d,e,i,f,g,h,j]. 

The data transfer rate in both modules, Rxlane and Framer, is equal and it is at 10-bits every 5 nSec or 10-bits 
at 200 MHz. 
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Figure 2.3: Receiver Lane 


The Rxlane has PLL which gets a copy of sclk (200 MHz) on its aRefClkP pin. The SiCortex system uses a single 
oscillator to drive the primary clock distribution tree. A copy of the primary clock distribution tree is referenced 
by the transmitter to drive the transmitter PHY. Because the origin of both clocks, aRefClkP and the transmitter 
clock, is the same oscillator, the difference in frequency between aRefClkP and recovered clock, aRxCIkN, from the 
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receiver PHY is 0-ppm. It is important to note that the received clock aRxCIKN is extracted from the incoming 
datastream, and not from the aRefClkP, even if it does connect to aRefClkP prior to starting CDR (clock and data 
recovery) function. 

The detailed description of the Framer module is described in section-2.7.2. 


2.7.1 Clock Alignment and Synchronizer setup between Rxlane and Framer transfer 


The clock alignment between aRxCIKN and sclk must take place after both clocks, aRxCIKN and sclk, are stable, 
PLLs are locked, and the reset signal is deasserted. All data trasfers between Rxlane and Framer are ignored before 
the clock alignment step is complete. 

Rxlane and Framer handshake exploits the fact that the aRxCIkN and aRxDO[19:0] is the source synchronous 
transfer. Please note following four salient points about data tranfer between Rxlane and Framer module. 


1. The Framer logic design will sample state of aRxCIKN signal to find alignment between two clocks, aRxCIkN 
and sclk, and then adjust aRxCIkN for data synchronizing transfer between Rxlane and Framer. 


2. The Framer logic design will not use aRxCIkN clock to strobe data transfer from aRxDO/[19:0]. 


3. AnalogBits design team will be matching electrical delays on 21 signals, aRxCIkN and aRxDO/[19:0], from 
internal cells of IP to output port of Rxlane. 


4. The Sicortex design team will be matching electrical delays on 21 signals, aRxCIkN and aRxDO[19:0], from 
port of Rxlane to receiver cells in Framer. 


The frequency of sclk and aRxCIkN is identical and it is 200 Mhz. However, the phase relationship of aRxCIkN 
wrt sclk is in-determinate because the phase relationship between the two clocks depend on the electrical length of 
the receiver lane. For aligning aRxCIkN with sclk, Rxlane will allow shifting the phase of aRxCIkN in increments 
of 1-bit time. The Rxlane will shift the phase of aRxCIkN by stretching aRxCIkN clock by 1-bit time. The clock 
stretching will not be a glitchless operation, however, sampling of aRxCIkN will be performed only after the clock 
alignment operation is completed. 


2.7.1.1 SkipBeat Handshake 


Refer to Section-8.1 of “Serdes PMA Programmer’s Reference Manual” for the details of the SkipBeat handshake. 
The SkipBeat timing parameter table for Sicortex design is shown below: 


max wp 
aRxCIKN period | 3 - (15ns @ 200 Mhz 3 - (15ns @ 200 Mhz) | 3- (15ns @ 200 Mhz) 
( ( ) ( ) 


) 
aRxCIKN period | 31 - (155ns @ 200 Mhz) | 31 - (155ns @ 200 Mhz) [ 31 - (155ns @ 200 Mhz 


The algorithm for aligning aRxCI1kN with sclk is described below: 


Begin: 
First_Search : 
Move phase of aRxC1kN by 1-bit time. 
Test logic level of aRxC1kN. 
If it is 0 then set flag-First_Search and jump to Second_Search else repeat. 
Second_Search : 
Move phase of aRxC1kN by 1-bit time. 
Test logic level of aRxCl1kN. 
If it is 1 then set flag-Second_Search and jump to Final_Search else repeat. 
Final_Search : 
Move phase of aRxC1kN by 1-bit time. 
Test logic level of aRxC1kN. 
If it is 0 then set flag-Final_Search and jump to Adjustment else repeat. 
Adjustment : 
Move phase of aRxC1kN by 5-bit times, set flag-Adjustment and exit. 
End: 
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After the receiver clock, aRxCIkN, is stable, the worst-case time to complete the skipbeat handshake at 200 Mhz 
is calculated as below: 

1. The maximum time taken in First_Search = 5 skipbeat operations x 31 sclk periods = 5 x (31 x 5) = 775ns 

2. The time taken in Second_Search = 5 skipbeat operations x 31 sclk_periods = 5 x (31 x 5) = 775ns 

3. The time taken in Final_Step = 5 skipbeat operations x 31 sclk_periods = 5 x (31 x 5) = 775ns 

4. The time taken for adjustment = 5 skipbeat operations x 31 sclk_periods = 5 x (31 x 5) = 775ns 


5. Total time in skipbeat handshake = 1 + 2+ 3+ 4 = 3100ns 


2.7.1.2 The RxClk alignment 


The clock alignment and receiver synchronizer timing diagram is shown in Figure-2.4. 
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Figure 2.4: Clock alignment and Receiver Synchronizer 


The timing diagram shows that sclk at 200 Mhz. After adjusting for the jitter spec of sclk, CalRxclk signal shows 
metastability region of a flop in TSMC-90G. The timing diagram shows early arrival of aRxCIKN signal transitioning 
from low-to-high (which will be sampled as going from high-to-low! This is non-intuitive but sampling state of 
CalRxclk will observe its output state going from 1 to 0). After accounting for datapath and clock path mismatch, 
the databus at RxDO[19:0] will have valid data window which is equal to period of sclk minus data path mismatch 
between Rxlane and Framer. The final adjustment of 5-bit time for aRxCIKN provides equalized set-up and hold 
time at RxDO[19:0] register. 

The timing budget for each timing component in clock alignment data path is shown in Figure-2.5. The last 
column which is a comment column shows ownership of each line item. The “ICE9 spec” items are owned by 
Sicortex while “AnalogBits Spec” items are owned by AnalogBits. 


2.7.2. The Framer Module 


The Framer module interfaces with Rxlane in slow clock (sclk at 200 Mhz) domain. The block diagram of the 
Framer module is shown in Figure-2.3. There are two primary tasks of Framer module are described below. 
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Row Name Formula Min Max Margin Comment 

1 V sclk_period 5 5 5 ICE9 ape spec : 200 Mhz 

2 V sclk_rise_jitter -0.05,0.05 -0.05 0.05 ICE9 clock input spec. BC, WC = -50ps, 50ps 

3. V sclk_fall_jitter -0.05,0.05 -0.05 0.05 ICE9 clock input spec. BC, WC = -50ps, 50ps 

4  V setup_plus_hold — [-0.4,0.4] -0.4 0.4 TSMC 90g FLOP et exaggerated mr 

5 V early_valid_clk -0.5 -0.5 -0.5 sclk_rise_jitter(min) + See us_hold(min) - CDR_rise_jitter(max) 
6 V CDR rise_jitter -0.05,0.05 -0.05 0.05 AnalogBits PLL spec. - BC, WC = -50ps, 50ps 

7 ~ V CDR_fall_jitter -0.05,0.05 -0.05 0.05 AnalogBits PLL spec. - BC, WC = -50ps, 50ps 

8  V Rxlane_mismatch Bont} 0 0.1 AnalogBits spec - BC, WC = 0-100ps 

9 V PAR_mismatch 0,0.1 0 0.1 ICE9 spec. - BC, WC = 0-100ps 

10 V adjust_bit_time 5 5 5 ICE9/AnalogBits spec. = reer ar Pee nnen in bit-time 
11. V meta_window ne Ik. = 0 1 ICE9 spec : Metastable window of a flop = ins 

12 V bilateral_skew -0.5,0. -0.5 0.5 enalabits et ec in bit-time - BC, WC = -500ps, 500ps 

13. V_ setup_check 4 0.4 0.4 TSMC 90g Flop setup check 

14 V_ hold_check 0.4 0.4 0.4 TSMG ore hold check 

15 C Setup(min margin) [setup_check,] 0.4 <0.8,><0.8,:TSMC 90G, setup constraint check 

16 C Hold(min margin) [hold_check,] 0.4 <1,><1,> TSMC 90G, hold constraint check 


Figure 2.5: Clock alignment timing budget 


2.7.2.1. The clock alignment and synchronizer setup 


The clock alignment and synchronizer setup with Rxlane is described in section-2.7.1. 


2.7.2.2 Framing Function and flag-LaneHealth 


The Rxlane receives serial data stream without any indication of framing boundary and passes 20 bits of de- 
serialized data, aRxDO[19:0], at 200 Mhz to Framer. The content of aRxDO[19:0] has data fields in the form 
of <current_10bits,previous_lObits>. The Rxlane transmits MSB (aRxDO[19]) bit which has the most recent bit 
arrived on PHY and LSB (aTxDO/[0]) bit which has the earliest arrived bit on PHY. The Framer has to find framing 
boundary of incoming data stream. To aid framing function, all lane transmitters in ICE9 will drive k_28.5 while 
framing function is active. 

The data on serial lane is 10-bit encoded data, so there are 10 possible framing boundaries within incoming 
serial data. Only 1 of those 10 framing boundaries is a valid framing boundary. The Framer forms 10 possible 
character strings of incoming data stream. It is assumed that each string is given an identifier, starting from 0 to 
9. 

A framing controller, called framer, has 10-stage counter called rotator. Rotator stages are from 0 to 9. The 
rotator stage is used to select character string identifier. 

Framer will find framing boundary of incoming data stream by setting rotator to a stage for 64 consecutive 
clock cycles. Framer will validate framing boundary, if and only if, it has received valid K28.5 (NULL) characters 
without disparity errors for at least 48 cycles. If framer has not found framing boundary then then it will increment 
rotator stage and perform above test again. There are only 10 possible framing boundaries in free running data 
stream, hence above scheme will find framing boundary in about (64 x10) = 640 characters. 

Time to send 1 character on link is 5nS, so framing will take about (640 x 5) = 3.2uSec. 

When Framer is successful in finding a frame within incoming data stream, it sets the flag-LaneHealth indicating 
that it is receiving error free K28.5 characters from Rxlane and lane’s health is declared “good”. 

After setting flag-LaneHealth, framer switches its function to check for the condition of loss of framing using 
credit based algorithm. The framer may now receive ANY of the data or control characters. The framer assigns 
health_rating of OxF to the lane. A lane can not receive higher than OxF count of health_rating and lane can not 
receive lower than 0x0 count of health_rating. When health_rating of a lane reaches 0x0, lane is non-usable and it 
is declared “bad” and indicated so by clearing of flag-LaneHealth. 

The Framer receives a character from a serial lane at the rate of 200 Mhz. The Framer evaluates every character 
it received and determines if it is a credit or a debit. A character without an error is a credit and a character with 
an error is a debit. The framer adjusts lane’s health_rating for every character. If health_rating of lane ever reaches 
0x0 then framer determines that lane has lost framing, its health status is bad, and clears flag-LaneHealth. When 
flag-LaneHealth is reset, the framer re-enters the framing function. 


2.7.3 The Wordsync function 


The fabric switch transmits and receives 64-bit data (or FORD) to/from the link. Though the transmit link 
transmits data on eight transmit lanes wrt to sclk, due to eight seperate physical paths taken from one ICE9 to 
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another, the data propagation delay mismatch may result among eight lanes. In such case, the Wordsync module 
in receiver may observe mismatched arrival times on eight data lanes. The Wordsync function equalizes electrical 
delays among eight receiver lanes. The Figure-2.6 shows implementation details of the Wordsync function. 


Framer Wordsync 


aRxStable - 
SkipBeat eae 
aRxCIkN 
___|decod¢d_data_d3 
| decoded |data_d2 
decoddd | dafa_d1 
rotator 
RxDO 
link_char decoded_data 
aRxDO[19:0] 
PY tial Loopback 
& 
BitBlasting 


sclk 


Figure 2.6: Wordsync Function 


The Wordsyncing among 8 receiver lanes is achieved in three steps. First step is to measure the propagtion 
delay differences among eight lanes. The next step is to increase the electrical delays of the faster lanes (and thus 
making them slower). Final step is the validation step of verifying that the total propagation delay of eight receiver 
lanes is equal. 

The Wordsync module has provision to delay data byte received from the receiver lane by either 1 or 2 or 3 sclk 
periods. 

To measure the propagation delay difference among eight receiver lanes, a special character k28.1, is sent by 
the transmit link on 8 transmitter lanes on the same rising edge of sclk and then in eight Wordsync modules of 
the receiver link, the arrival time of k28.1 are is noted. The lanes receiving k28.1 earlier are faster. The Wordsync 
module can measure propagation delay difference of upto 3 sclk periods among eight receiver lanes. In next step, 
the Wordsync module will increase data propagation delay of the faster lanes and make them equal on all 8 receiver 
lanes. The final step is the verification step. In this step, the transmitter will transmit a special character k28.1 
again and receiver lanes will validate that all eight of them received k28.1 in the same sclk cycle. Next, the 
transmitter transmits all 534 valid 8B10B characters, each character twice, on all 8 lanes. Upon receiving all (536 
x 2) characters without an error on all receiver lanes completes the wordsync function. 


2.7.4  Rxlane to Framer data latency estimates 


Data transfer latency estimates is presented below. Latency is calculated from the last bit (MSB bit of 
aRxDO[19:0]) on receiver PHY to decoded_data in Framer module in bit time. 
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2.7.5 Rxlane module ports 


Sigal Names 
aRxClkP Reference clock signal for PLL. 


aRxStable Status signal indicating that extracted clock aRxCLKN is 
stable. 


SkipBeat In Framer | Handshake signal for clock alignment between Framer and 
Rxlane. When asserted, Rxlane will skip aRxCIkN clock 
by 1-bit time. 


aRxDO[19:0] Out Framer | Deserialized 20-bit data from receiver PHY. 
The content of aRxDO[19:0] has data fields in the form 
of <current_10bits,previous_l0bits>. The Rxlane trans- 
mits MSB (aRxDO[19]) bit which has the most recent bit 
arrived on PHY and LSB (aTxDO[O0]) bit which has the 
earliest arrived bit on PHY. This databus is source syn- 
chronous to aRxCIkN at 200 Mhz. 


2.7.6 8B10B code Validation Plan 


Verification team will get 8b10b code standard from IEEE 802.3 (ethernet) spec. Verification team will verify 
and validate each Rxlane against 802.3 spec, including negative cases of errors. 


2.7.7 Verification Checklist: 


1. Verify aRxCIkN/sclk synchronizers 

2. mis-alignment of aRxClkN among group of 8 

3. verify SkipBeat function 

4. verify Skipbeat offset variable 

5. verify manual operation of clock alignment (or SkipBeat function) 
6. Verify force lane-health function 

7. Verify enable/disable lane health 

8. Verify force Wordsync function 

9. Verify Wordsync function through SCB 


10. Verify number of data pattern selection in Wordsync function 
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Figure 2.7: Receiver Link 


2.8 The Fabric Link Receiver 


The Fabric Link Receiver (RxLink or FLR) has eight serial receiver lanes for the data packet transfers and one 
serial transmitter lane for the control packet transfers. The RxLink is constructed from eight Rxlane modules, 
one TxLane module, one RxLC module which contains eight Wordsync modules, and one LaneEncoder module as 
shown in Figure-2.7. 

The Fabric switch clock, sclk @ 200 MHz, is distributed to the RxLC and LaneEncoder modules as its primary 
clock. Copy of sclk is also distributed to the Rxlane-7 through 0 on their aRefClkP pins. Inverted copy of sclk is 
distributed to the Txlane-FC on its aRefClkP pin. 

Eight Rxlanes, numbered 7 through 0, receive serial data on receiver PHY, receover data and clock from 
the incoming serial data stream, and drive deserialized data aRxDO/[19:0] and clock aRxCIkN to eight Wordsync 
modules. Eight Wordsync modules, numbered 7 through 0, are within the RxLC module. Each Wordsync module 
handshakes with one Rxlane module to setup the synchronizer transfer from the Rxlane and then acquire framing 
from incoming data stream. Then the RxLC module acquires the Wordsynchronizations among eight Wordsync 
modules. After acquiring Wordsynchronization, the RxLC module decodes received data from Rxlanes and then 
transfers 64-bit FORD and 3-bit status (SOP, EOP, IDLE) to the fabric Switch every sclk cycle. 

The control packets travel in the opposite direction. The control packets originate in the fabric switch. From 
there, they travel through the LaneEncoder where they get 8B10B encoded. Encoded data from LaneEncoder move 
to aRxDI[9:0] register of the Txlane-FC, which serializes and transmits data on transmitter PHY. 

The RxLC module has another output port for supporting RxLink bringup routine, FarEndLoopback and 
BitBlasting modes. 

The RxLC has a controller called RxLink Controller (RxLC) which has 2 main functions. 


1. Acquire link synchronization by executing the hardware routine called RxLinkSync. Upon successful comple- 
tion of RxLinkSync, RxLink has acquired link synchronization. 
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2. After successful completion of RxLinkSync, RxLC enters the state of MissionMode during which RxLink 
is functional and the fabric switch at both ends of the RxLink can exchange data and control packets. In 
MissionMode, RxLC will act as a link supervisor and keep checking for link errors including loss of link 
synchronization. When it detects that link synchronization is lost, then RxLC exits MissionMode and enters 
hardware routine RxLinkSync for re-synchronization of RxLink. 


In hardware execution routine RxLinkSync, RxLC controller communicates with hardware execution routine (called 
TxLC and described later in section-2.9) of corresponding 8-lane transmitter of ICE9 using return path through 
LaneEncoder. In hardware execution routine, RxLC is the master and TxLC is the slave. The RxLinkSync routine 
gets executed once after power is up, and after PLLs are locked, and the reset signal is negated. The RxLinkSync 
routine is entered from MissionMode if loss of link synchronization is detected. 

Loss of synchronization, i.e. loss of heartbeat, will occur when any of the following conditions is detected. 
Clearing of flag-LaneHealthStatus due to any of the following cases. 

(a) loss of signal on serial receiver or due to excessive character errors and/or disparity errors on any of the 8 
data lanes, 

(b) setting of flag-ForceRetraining through SCB (see section-2.8.1 for explanation), 

(c) clearing of flag-Heartbeat from heartbeat timeout on data-lane-0 (see section-2.8.1), 

(d) disabling RxLink with SCB RxLcControl Ena bit, SoftReset, or hard reset line. 


2.8.1 Status Flags required by RxLinkSync and RxLC 
RxLC will have following status flags. 


1. Flag-AllRxlanesReset The flag-AllrxlanesReset is set when PLL of all eight RxLanes are locked, and reset 
signal is deasserted in all eight RxLanes in their rxfclk domain, and reset signal in sclk domain is deasserted. This 
flag is reset when any one of eight PLLs of RxLanes has lost lock, or any one of eight RxLanes has reset signal 
asserted, or reset signal in sclk domain is asserted. 


2. Flag-LinkHealth Each RxLane provides status of lane’s health in real time through a flag-LaneHealthStatus. 
A flag-LaneHealthStatus is set if RxLane has acquired frame, and it is receiving valid data and control characters 
from receivers without disparity errors, otherwise flag is reset. Software may also reset flag-LaneHealthStatus 
through SCB by setting ClrLaneHealth. There are 8 flags, flag-LaneHealthStatus. The flag-LinkHealth is created 
from lane health status. The Flag-LinkHealth is set if all 8 flag-LaneHealthStatus are set otherwise it is clear. 


3. Flag-ForceRetraining The hardware execution routine RxLyncSync may be initiated through SCB by 
setting flag-ForceRetraining. The transition from 0 to 1 of flag-ForceRetraining causes RxLyncSync to be initiated. 
Software should then clear flag-ForceRetraining so it is available for future use. 


4. Flag-Heartbeat During MissionMode the RxLink uses a “heartbeat” method of detecting good communica- 
tion from the TxLink in the other chip. The following steps describe heartbeat operation: 


e When the RxLink achieves MissionMode, flag-Heartbeat is set. 


e The fabric switch drives data packets using 8 lanes. The data packets are bounded by SOP (start of packet 
char, k28.3) and EOP (end of packet char, k28.4) characters. The transition from SOP and EOP characters 
to non-SOP and non-EOP characters detects the heartbeat. 


e When the fabric switch is idle, it drives IDLE packets. The transmitter link will drive IDLE packets on lane-0 
using NULL (k28.5) or AIDLE (alternate idle char k28.0) characters on link. During idle cycles, transition 
from AIDLE character to non-AIDLE character detects the heartbeat. 


e During MissionMode, if heartbeat is not detected for consecutive 128 clock cycles, by either of the above 
methods, then it is assumed that link has lost heartbeat and flag-Heartbeat is cleared, otherwise it remains 
set. 


A loss of Heartbeat causes a loss of MissionMode, and routine RxLyncSync is re-entered. 


e Also, if MissionMode is lost for any other reason, flag-Heartbeat will be cleared. 
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5. Flag-RxLinkSync The flag-RxLinkSync is a status flag. It is controlled by Wordsync module. This flag 
is set when link is executing hardware routine RxLinkSync, otherwise this flag is clear. This flag will remain 
set if hardware routine RxLinkSync has encountered a failure and/or hardware routine RxLinkSync has not been 
completed successfully. 


6. Flag-MissionMode The flag-MissionMode is a status flag. It is controlled by Wordsync module. This flag is 
set when hardware execution routine RxLinkSync has been successfully completed i.e. routine has been successful 
in acquiring lane framing, and word framing. When flag-MissionMode is set, it indicates that RxLink is operational 
and control of a link has been transferred to the fabric switch. Setting of flag-MissionMode implies that (i) fabric 
switch at both ends will maintain link heartbeat on data transfer in both direction, (ii) spurious lane errors will be 
detected by lane controllers as bit errors, and those errors are logged, (iii) spurious bit errors will not make link 
unusable, and (iv) persistent bit errors on one or more lanes will cause loss of link health by resetting one or more 
flag-LaneHealth(s), which in turn, will force re-entry of the hardware execution routine RxLinkSync by resetting 
flag-MissionMode and setting flag-RxLinkSync. 


2.8.2 RxLinkSync Routine 
Jump to Begin: 


e BEGIN: 
If (flag-ForceRetraining) then jump to Step-1 


e Step-1: 
Set flag-RxLinkSync, reset flag-MissionMode, reset flag-Heartbeat. 
Force Idle on FORD-to-FabricSwitch, Disable data path from ControlPacket-to-link, Force k_28.5 on TxLane 
(send NULL) 
(sending k_28.5 without Heartbeat packet will force TxLC to jump to TxLinkSync Routine) 
Wait till flag-LinkHealth is set, then jump to Step-2. 
(when flag-LinkHealth is set, then all lanes are receiving k_28.5) 
Note: If flag-LinkHealth is asserted for less than 3-ticks, then controller will not jump to Step-2 and will 
re-enter or remain in Step-1. For each occurance of such case, or each jump to Step-2, R-FlrxRxLcCount will 
be incremented. 


e Step-2: 
If (flag-LinkHealth) then jump to Step-1 
else 
Force k_28.5 on TxLane (send NULL) 
Wait for time Tl = (R_FlrxRxLcControl.Step2WaitTime number of sclks), where 
Tl1(min) = (Rate of Heartbeat + 4 times maximum link delays) = (100 * sclk period) + 4 * 10nS = 500 + 
40 = 540nsec 
(sending k_28.5 without Heartbeat packet will force TxLC to jump to TxLinkSync Routine) 
Jump to Step-3 


e Step-3: 
If (flag-LinkHealth) then jump to Step-1 
else 
Force pattern of k_28.5 and k_28.0 (send alternate NULL characters at the rate of once every 256 sclk cycles) 
Wait to receive pattern of k_28.5 and k_28.0 (wait for TxLinkSync routine to respond) 
Jump to Step-4 


e Step-4: 
If (“flag-LinkHealth) then jump to Step-1 
else 
(Note: This is the Wordsync Routine. Refer to Section-2.7.3 for details of the operation.) 
Send and wait for first request to return SOLS char (delay calibration cycle request to return SOLS character 
k28.1) 
Send and wait for second request to return SOLS char (word alignment cycle request to return SOLS character 
k28.1) 
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Send and wait for valid verification data patterns (512) and control chars (24) , EOLS (k28.2) being the last 
one 

After that, send NULLs (k28.5), while still in Step-4. 

if (Wordsync error, or data verification error) then 

Set error bits and stay in Step-4 until ~flag-LinkHealth. 

else 

Wait for EOLS to come back from TxLink in other chip (with no time limit) and then 

Set flag-Heartbeat, and jump to END 


e END: (enter MissionMode operation) 
Set flag-MissionMode, reset flag-RxLinkSync. 
Enable data packet path from RxLink-to-FabricSwitch. 
Enable control packet path from FabricSwitch-to-RxLink 
Become RxLink supervisor, watching for Heartbeat and bit errors. 
Log bit error(s) and disparity error(s) observed on RxLink 
if (~flag-LaneHealth) OR (flag-ForceRetraining goes 0-to-1) OR (~flag-Heartbeat) 
Jump to Step-1 


2.8.3 Verification Checklist: 


1. mis-alignment of rxfclk among group of 8 
2. verify framing and loss of framing conditions 
3. verify Lane Health Status algorithm 
4. verify manual clearing of flag-LaneHealth 
5. verify bit errors - invalid characters and disparity errors 
6. verify user programmable time delay 
7. Set/clear flag-LaneHealthStatus during RxlinkSync from primary input. 
8. Asynchronous events flag-ForceRetraining and flag-Heartbeat 
9. Set/clear flag-RxLinkSync, flag-MissionMode 
10. Enable/disable Rxlc 
11. Verify FarEndLoopback mode 


12. a. Verify bit-blasting mode 
b. Inject disparity error and invalid character error during bit-blasting mode 


2.9 The Fabric Link Transmitter 


The Fabric Link Transmitter (TxLink or FLT) design has eight serial transmitter lanes for data packet transfers 
and one serial receiver lane for control packet transfers. The Txlink is constructed from one LinkEncoder which is 
comprised of eight LaneEncoders, eight TxLanes, one Rxlane, and one TxLC module, as shown in Figure-2.8. 

Data packets originate at the fabric switch and send 64-bit wide FORD to LinkEncoder every sclk. The FORD 
is segmented into eight lanes, each lane carrying a byte. The lanes are identified from 7 through 0. The LinkEncoder 
has eight LaneEncoders which are identified as LaneEncoder 7 through 0. The LinkEncoder segments a FORD into 
eight lanes and transfers a lane to each LaneEncoder. The LaneEncoder performs 8B10B encoding and transfers 
encoded_data|9:0] to TxLane. There are eight Txlane modules and they are identified from 7 through 0. The 
TxLane serializes data from LaneEncoder and transmits it on SERDES PHY. Thus the data path originates at 
fabric switch in byte-x, and then passes through LaneEncoder-x, TxLane-x, and ends at the serial transmitter PHY. 

Correspondingly, 8B10B encoded control packets arrive on serial receiver PHY of Rxlane-FC. The Rxlane-FC 
will de-serializes data and transfers aRxDI[19:0] to Wordsync module which is part of the TxLC module. The 
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Figure 2.8: Transmitter Link 


Wordsync module will decode 8B10B coding and transfer a byte of data every sclk cycle to the fabric switch. TxLC 
has another output port for supporting loopback operations. 

Fabric switch clock, sclk at 200 MHz, is distributed to the TxLC and the LinkEncoder modules as the primary 
clock. True copy of sclk is distributed to Rxlane-FC on its aRefClkP pin. The inverse copy of sclk is distributed to 
Txlane modules on their aRefClkP pins as a reference clock. 

TxLC has a controller which is responsible for (i) executing hardware routine called TxLinkSync. Upon suc- 
cessful execution of hardware routine TxLinkSync, TxLink has acquired link synchronization. (ii) Upon successful 
completion of TxLinkSync, TxLC enters the state of MissionMode during which TxLink is functional and switch 
fabric at both ends of ICE9 can exchange packets. In MissionMode, TxLC will act as a link supervisor and keep 
checking for link errors including loss of link synchronization. When it detects that link synchronization is lost, 
then TxLC exits MissionMode and enters hardware routine TxLinkSync for re-synchronization of TxLink. TxLC 
has another output port for supporting TxLink loopback path. 

In hardware execution routine TxLinkSync, controller TxLC communicates with hardware execution routine 
(called RxLC and described earlier in section-2.8) of corresponding 8-lane receiver using return path through 
LinkEncoder. In hardware execution routine, TxLC controller acts as a slave and RxLC controller acts as a master. 
Hardware routine TxLinkSync gets executed once after power is up, and after PLLs are locked, and the reset signal 
is negated. The TxLinkSync routine is entered from MissionMode if loss of synchronization, i.e. loss of heartbeat, 
is detected. 

Loss of TxLink synchronization will occur when any of the following conditions is detected. 

(a) loss of signal on serial receiver or excessive character or disparity errors on fc-lane, 

(b) setting of flag-ForceRetraining through SCB (see section-2.9.1), 

(c) clearing of flag-Heartbeat from heartbeat timeout on the fc-lane (see section-2.9.1), 

(d) disabling TxLink with SCB TxLcControl Ena bit, SoftReset, or hard reset line. 


2.9.1 Status Flags required by TxLC 


TxLC will have following status flags. 
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1. Flag-LaneHealthStatus RxLane-FC provides lane status through flag-LaneHealthStatus. The flag-LaneHealthStatus 
is set if RxLane-FC is receiving valid 8B10B encoded characters from lane, incoming characters are clear of disparity 

errors, and has acquired frame, otherwise flag is reset. Software may also reset flag-LaneHealthStatus through SCB 

by setting ClrLaneHealth. 


2. Flag-ForceRetraining The TxLinkSync routine may be initiated by Software setting flag-ForceRetraining 
on SCB. The transition from 0 to 1 of flag-ForceRetraining causes TxLyncSync to be initiated. Software should 
then clear flag-ForceRetraining so it is available for future use. 


3. Flag-Heartbeat During MissionMode the TxLink uses a “heartbeat” method of detecting good communica- 
tion from the RxLink in the other chip. The following steps describe heartbeat operation: 


e When the TxLink achieves MissionMode, flag-Heartbeat is set. 


e During MissionMode, the RxLink in the other chip drives continuous control packets which will use SOP 
(start of packet char, k28.3) character as a marker. The transition from SOP character to non-SOP character 
detects the heartbeat. 


e During MissionMode, if heartbeat is not detected for consecutive 128 clock cycles, then it is assumed that 
link has lost heartbeat and flag-Heartbeat is cleared, otherwise it remains set. 


e A loss of Heartbeat causes a loss of MissionMode, and routine TxLyncSync is re-entered. 


e Also, if MissionMode is lost for any other reason, flag-Heartbeat will be cleared. 


4. Flag-TxLinkSync The TxLC controller maintains a status flag-TxLinkSync. When flag-TxLinkSync is set, 
it indicates that TxLC is in hardware execution routine otherwise this flag is reset. This flag will remain set if 
hardware execution routine TxLinkSync has encountered a failure and/or routine has not completed successfully. 


5. Flag-MissionMode The flag-MissionMode is a status flag. It is controlled by TxLC module. This flag is 
set when hardware execution routine TxLinkSync has been successfully completed. When flag-MissionMode is set, 
it indicates that TxLink is operational and control of a link has been transferred to the fabric switch. Setting of 
flag-MissionMode implies that (i) fabric switch at both ends will maintain link heartbeat on data transfer in both 
direction, (ii) spurious lane errors will be detected by FC lane controller as bit errors, and those errors are logged, 
(iii) spurious bit errors will not make link unusable, and (iv) persistent bit errors on FC lane will cause loss of 
link health by resetting of flag-LaneHealth, which in turn, will force re-entry of the hardware execution routine 
TxLinkSync by resetting flag-MissionMode and setting flag-TRxLinkSync. 


2.9.2 TxLinkSync Routine 
Jump to Begin: 


e Begin: 
If (flag-ForceRetraining) then jump to Step-1 


e Step-1: 
Set flag-TxLinkSync, reset flag-MissionMode, reset flag-Heartbeat. 
Force Idle on Control Packet-to-Switch, Disable data path from FORD-to-TxLink, Force k_28.5 on all 8 
LaneEncoders (send NULL) 
(sending k_28.5 without Heartbeat will force receiver to jump to RxLinkSync routine) 
Wait till flag-LaneHealthStatus is set, then jump to Step-2 
Note: If flag-LinkHealth is asserted for less than 3-ticks, then controller will not jump to Step-2 and will 
re-enter or remain in Step-1. For each occurance of such case, or each jump to Step-2, R-FltxTxLcCount will 
be incremented. 


e Step-2: 
(when flag-LaneHealth is set, then control lane is receiving k_28.5) 
If (“flag-LaneHealth) then jump to Step-1 
else 
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Force Idle on Control Packet-to-Switch, Disable data path from FORD-to-TxLink, Force k_28.5 on all 8 
LaneEncoders (send NULL) 

(sending k_28.5 without Heartbeat will force receiver to jump to RxLinkSync routine) 

Wait for time Tl = (R_FltxTxLcControl.Step2WaitTime number of sclks), where 

T1(min) = (Rate of Heartbeat + 4 times maximum link delays) = (100 * sclk period) + 4 * 10nS = 500 + 
40 = 540nsec 

Jump to Step-3 


Step-3: 

If (~flag-LaneHealth) then jump to Step-1 

else 

Force Idle on Control Packet-to-Switch, Disable data path from FORD-to-TxLink, Enable FarEndloopback 
path 

Jump to Step-4 


Step-4: 

If (~flag-LaneHealth) then jump to Step-1 

else 

If EOLS (k28.2) then Disable FarEndloopback path, set flag-Heartbeat, and jump to END 

else 

Remain in Step-4, continue FarEndloopback, keep watching for EOLS or ~flag-LaneHealth (no time limit). 


END: (enter MissionMode operation) 

Set flag-MissionMode, Reset flag-TxLinkSync 

Enable data path from FabricSwitch-to-TxLink. 

Enable control packet path from TxLink-to-FabricSwitch 

Become TxLink supervisor, watching for Heartbeat and bit errors. 

Log bit error(s) and disparity error(s) observed on TxLink 

if (“flag-LaneHealth) OR (flag-ForceRetraining goes 0-to-1) OR (~flag-Heartbeat) 
Jump to Step-1 


2.9.3. Verification Checklist: 


1. 
2. 
3. 
4 


Set /clear flag-LaneHealthStatus during TxLinkSync 
asynchronous events flag-ForceRetraining, flag-Heartbeat 


set/clear flag-TxLinkSync, flag-MissionMode 


. Enable/disable TxLC 
. Verify FarEndLoopback mode 


. a. Verify bit-blasting mode 


b. Inject disparity error and invalid character error during bit-blasting mode 


2.10 Reset bring-up sequence 


Following steps are required to bring-up link after reset: 
(later on we say what to do to cause each of these steps) 


1. 
2: 
3. 


Wait for refclk stabilization time = TBD 
Wait for QPMA Tx PLL(s) lock (aTxClkP stabilization time) = 15 uS 


a. Wait for calibration time = TBD 

b. Wait for QPMA Rx PLL(s) unlock = 10 uS 

c. Wait for QPMA Rx PLL(s) lock (aRxCIKkN stabilization time) = 15 uS 
(AnalogBits “ABIPCCE2 Custom PLL DATASHEET” 

says 10 uS is enough, but we’ve seen it take slightly longer to lock) 
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4. Wait for Skip-beat operation. Max Time it takes = 
a. max. skipbeat operation before step_step = 5 x (31 x 5) = 775ns 
b. skipbeat operations before second_step = 5 x (31 x 5) = 775ns 
c. skipbeat operations before final_step = 5 x (31 x 5) = 775ns 
d. skipbeat operation at adjustment = 5 x (31 x 5) = 775ns 
e. Total time = 775 x 4 = 3100ns 


5. Wait for framing. Max time = 3200ns as described in section on framing. 
6. Now PMA is ready and LinkSync can begin 


7. When link enters MissionMode, invalid character error and disparity error counters may contain non-zero 
values. Software must initialize these registers before enabling interrupt from these registers. 


2.10.1 When do Link Registers Get Reset 
2.10.1.1 AnalogBits QPMA Registers 


AnalogBits Internal QPMA Registers only go to their ”reset values” or more-accurately, their ”power-on values” 
when the power gets turned on. They are unaffected by either SoftReset registers or the "hard reset” reset signal 
coming into FL. This refers to the registers that are within the AnalogBits QPMA’s, not the QSC registers. These 
registers are not directly accessible from the SCB bus. 


2.10.1.2 QSC Registers 


QSC Registers are reset by the “hard-reset” reset signal, but are not affected by any SoftReset registers. Most 
of the QSC Registers (with the exception of R_QscInterrupt) are used to allow indirect access from the SCB bus 
to the AnalogBits Internal QPMA Registers. 


2.10.1.3. FLT and FLR link Registers 


FIt0, Flt1, Flt2, Flro, Flr1, Flr2 Registers get reset by the ”hard-reset” reset signal, but are not affected by any 
SoftReset registers. The SoftReset bit of a particular link affects operation of that particular link only. 

For example, writing a 1 and then writing a 0 to R_Flr2SoftReset will cause all the control circuitry of Flr2 to 
go to their reset values including resetting all the internal state machines of link FLR-2. This will not cause any of 
the R_Flr2* registers to go to reset values. 


2.10.2 Enabling Links 


When power first comes on the Links are disabled and non-operational in several ways: QPMA units do not 
have valid Impedance Settings, QPMA units do not have valid Calibration Settings, and Link units have their 
LinkSync Routines disabled. 

After a hard-reset, or SoftReset, if configuration had previously been done during this period of power being 
ON, the QPMA units retain their prior Impedance and Calibration Settings, but the Link units have their LinkSync 
Routines disabled. 

The recommended steps to being up links are : 


1. Determine QPMA Impedance Settings, to “factory values”, or discover them. 
2. Configure QPMA Calibration Settings, to saved values, or discover them. 

3. Initialize SkipBeat Functions. 

4. Enable Links. 
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2.10.2.1 Determine QPMA Impedance Settings 


Since these settings only depend on the silicon manufacturing process of that particular ICE-9’s individual 
QPMA cores, not which slot the board it’s on is plugged in to, the best settings can be determined at the factory, 
saved somewhere, and loaded at this time. 

If we choose to discover them each time we power-on, or when they’re determined ”at the factory”, they can be 
determined by a process of SCB bus writes and reads, without the link being enabled. The status of ICE-9’s at the 
other end of Links doesn’t matter. 


Ques: what about that link going to a slot with missing board? 


QPMA Impedance values are set using SCB register RLQscQpmalmpCalibration.The process of discovering 
correct values also uses SCB register RLQscQpmaStatus. Refer to Analogbit’s PRM manual for further detatils. 


2.10.2.2 Configure QPMA Calibration Settings 


QPMA Calibration Settings for a particular Link must be determined while that Link is enabled by the procee- 
dure above, and correct Impedance Settings must already be loaded. 

QPMA Calibration Settings for a particular Link must be determined by trial and error during the same time 
period that the ICE-9 at the other end of that Link’s fabric connection is also trying to determine it’s own QPMA 
Calibration Settings for Link on that fabric connection. 

QPMA Calibration values are set using R_QscGo, R_QscStatus, R-QscCA, R-QscSerDatAR, R_QscSerDatT, 
R_QscSerDatP. 

Note that the details of “good working algorithm for trial and error”, and what values to try, are not listed here 
yet. 

When both ends of a Fabric Connection have configured good QPMA Calibration values, each end can see 
that because the LinkSync routine will make progress to later steps. This can be seen in R_FlrxRxLcStatus and 
R_FltxTxLcStatus registers by looking at fields Steps and MissionMode. 

The ICE-9’s on the two ends of a particular Fabric Connection may be beginning this step at significantly- 
different times, differing by thousands of clocks. The early steps of the LinkSync Routines don’t mind this, don’t 
time-out, and will have no problem waiting for the other end to start trying Calibration Values. Similarly, the 
algorithm for trying values and checking LcStatus will be a repeating loop, continuing long enough for the other 
ICE-9 to start trying values. 


2.10.2.3 Initialize SkipBeat Functions 


For the 3 FLT’s, write 1, and then write 0 to bit SkipBeatEnable in R_FltxFcLaneControl, leaving field Skip- 
BeatOffset at it’s reset value (unless it has been determined that another value should be used). If QPMA PLLs 
are locked on stable clocks, the SkipBeat function is fast enough to complete before you can get the 0 written. 

For each lane in each of the 3 FLR’s write 1, and then write 0 to bit SkipBeatEnable in the R_FlrxLaneControl 
register for that lane, leaving field SkipBeatOffset at it’s reset value (unless it has been determined that another 
value should be used). 


2.10.2.4 Enable the Links 


A Link is enabled by writing 3 times to it’s "LcControl” register, first to enable it, then to set the ForceRT 
bit, then to clear the ForceRT bit. Write 0x2, then write 0x3, then write 0x2 to each of R_FIt0OTxLcControl, 
R_Flt1TxLcControl, R-Flt2TxLcControl, R-FlrORxLcControl, R-Flr1RxLcControl, R_Flr2RxLcControl. 

Note: All interrupt enables reset to a not-enabled state. If since the last reset interrupts have been enabled, it 
is desirable to disable interrupts from links which are about to enter in ForceRT. Also, before enabling link, it is 
desirable to clear all interrupts from that link, verify that all interrupt generating conditions are not present. 


2.11 Diagnostic Modes 
The diagnostic modes are supported to aid in lab debug of links. It is a requirement that for correct operation 


in diagnostic mode, the receiver link and the transmitter link at both ends of a link have successfully configured 
their respective QPMA calibration settings. Also, at most only ONE of the 3 diagnostic modes described below 
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(Near End Loopback, Far End Loopback, or Bit-Blasting Mode) should be enabled at any one time for a particular 
link or pair of links. 


2.11.1 NearEndLoopback Mode 


The NearEndLoopback mode of opeartion is supported to verify that receiver path is connected to transmitter 
path and thus verify data path from FSW to transmitter link to receiver link to FSW. 

In NearEndLoopback mode, a receive data link is connected to a transmit data link and thus receiver lanes 
are disconnected from off-chip path from PHY. It is important to note that the transmitter lanes will still drive 
transmitter PHY. 

All 3 links can be simultaneously configured in NearEndLoopback Mode. 


2.11.1.1 Link-0 


For connecting FLRO to FLTO, set LpBkNearEnd|3:0] field of R-LQscQpmaControl0 and R-QscQpmaControll 
register. Also set LpBkNearEnd[0] field of R-QscQpmaControl6. 


2.11.1.2  Link-1 


For connecting FLR1 to FLT1, set LpBkNearEnd|[3:0] field of RLQscQpmaControl2 and R-QscQpmaControl3 
register. Also set LpBkNearEnd[1] field of R-QscQpmaControl6. 


2.11.1.3  Link-2 


For connecting FLR2 to FLT2, set LpBkNearEnd|3:0] field of RLQscQpmaControl4 and RLQscQpmaControl5 
register. Also set LpBkNearEnd|[3] field of RLQscQpmaControl6. 


2.11.2 FarEndLoopback Mode 


The FarEndLoopback mode of operation is supported to verify that receiver link is loopbacked to transmitter 
link in SCLK domain, i.e. Flt0 and Flr0 can be connected, and/or Flt1 and Flrl can be connected, and/or F1t2 
and Flr2 can be connected. 

Do not confuse this with the so called “FarEndLoopback” used in the LinkSync Routine, which is within an 
individual FLT, FC lane in to the 8 data lanes out. 

This FarEndLoopback mode is supported to verify 8 data lanes and 1 flow control lane connectivity from receiver 
PHY to transmitter PHY in SCLK domain. The far end loopback path will bypass 10B8B decoding at the receiver 
end and 8B10B encoding at the transmitter end. 

The far end loopback mode will not invoke skipbeat function, or acquire lane health, or word synchronization. 

The far end loopback mode assumes that impedance and calibration circuit for a given link is initialized to 
correct settings and skipbeat function for that link is completed successfully. 

In FarEndLoopback mode, the mission mode signal going to FSW is de-asserted and thus FSW is disconnected 
to/from PHY or fabric switch is bypassed. This would allow the 2 remote Ice9’s that have been connected to each 
other through this local Ice9 to bring up MissionMode with each other through this 2-hop link. 

All 3 links can be simultaneously configured in FarEndLoopback Mode. 


2.11.3 Bit-Blasting Mode 


The bit-blasting mode is supported to verify link integrity from a transmitter to a receiver. For a given link, it 
is suggessted that bit-blasting mode may be invoked only after both ends of a link have entered in Mission Mode. 
The bit-blasting mode does not attempt to invoke skipbeat function, or does not attempt to acquire lane health, 
or word synchronization. 

Each link may be configured in Bit-Blasting Mode as follows: 


1. Verify that Link under test is in mission mode. 
This step is not mandatory step but it is strongly suggested because for bit-blasting function to operate 
correctly, lane must have successfully completed skipbeat function and acquired lane health. Note that bit- 
blasting mode does not attempt to invoke skipbeat function, nor does it attempt to acquire lane health, nor 
does it acquire word synchronization. 
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2. Write 0x8800 in R_FIlrxBBDiag register when FLR is to enter in bit-blasting mode (Refer to section-2.17.18). 
Write Ox80FF in R-FLtxBBDiag register when FLT is to enter in bi-blasting mode (Refer to section-2.16.16). 
This step serves 3 purposes: 

(a) It disables heartbeat counter from expiring which disables invoking hardware LinkSync routine. 

(b) It disconnects FSW from link by deasserting MissionMode and DataValid signals going to FSW. 

(c) It sends NULL and ANULL characters on all driver lanes which keeps other end of link in MissionMode. 
This will allow software to manage bit-blasting mode at both ends of link with ease. 


3. Write R_FlxxDiag register keeping BBMode set and also selecting other fields of this register. 
Note that driver lanes which are not selected to drive bit-blasting pattern will drive PNULL (k28.5) patterns. 


4, Monitor R_FlxxDiagStatus (section-2.17.19 and section-2.16.16) register for results of bit-blasting mode. 
There are 2 bits assigned per receiver lane. One bit is Sync-bit and it indicates if receiver lane has acquired 
synchronization for configured bit-blasting pattern, and the other bit is Error-bit which indicates if any error 
is observed after synchronization is acquired. 

By de-selecting lane, both status bits associated with this lane are cleared. By selecting lane again will make 
both status bits assocaited with this lane valid. While in BBMode, toggling of lane select field is permitted. 


5. Before exiting bit-blasting mode, execute above step-2. 


6. Clear R_FlxxDiag register to enter in MissionMode again. 


2.11.4 ATE Testing of Analogbits ABICDR43 


ATE testing of Analogbits ABICDR43 macro can be carried out at speed and in Near-End-Loopback mode. 
Instructions for Near-End-Loopback test follows. 


1. Execute reset power on sequence in ICE9. 


2. Put all seven QPMA in NearEndLoopback mode by writing to RLQscQpmaControl registers. 
(LpBkNearEnd=1, ForceTxHiZ=1, ForceRxHiZ=1, and clearing rest of the bits) 


3. Initiate SkipBeat function in FLRO, FLR1, and FLR2 receiver links. Also initiate Skipbeat function in FLTO, 
FLT1, adn FLT2 links. 
(toggle SkipBeatEnable bit in all LaneControl registers) 


4. Initiate LinkSync routine in FLRO, FLR1, and FLR2 receiver links. Also initiate Skipbeat function in FLTO, 
FLT1, adn FLT2 links. 
(set Ena bit and then toggle ForceRT bit in all LcControl registers) 


5. Wait for 10 microsec (enough time for links to reach MissionMode). 


6. Read link status register of FLRO, FLR1, FLR2, FLTO, FLT1, and FLT2 and verify that each link (a) is not 
in reset, (b) is in MissionMode, (c) has heartbeat, and (d) has its Step[3:0] field clear. 


2.11.5 PLL Bypass Mode Testing of Analogbits ABICDR43 


Analogbits ABICDR43 serdes macro (QPMA) has 5 seperate PLL. One is TXPLL and used by four transmitter 
lanes. The other four copies are CDRPLL and each receiver lane uses one copy. The PLL Bypass Mode test should 
configure all five PLLs of QPMA in bypass mode and then validate data path connectivity from transmitter lane to 
corresponding receiver lane. When PLL are in bypass mode, it generates internal high speed clock same as that of 
reference clock. Also, this test is intended to be used for structural testing of serializer and deserializer of QPMA. 

In PLL Bypass test, once QPMA is configured in PLL bypass mode, the data pattern of all 1’s is driven on 
its parallel port TxDI[9:0] and kept unchanged for 100 sclk cycles. Data from parallel port go through serializer 
of transmit path, then loops back because of near end loopback, and then gets deserialized in receiver path on 
subsequent clock cycles. After “TBD” sclk cycles (but less than 100) cycles later it settles down on receiver parallel 
port RxDO[19:0]. Test will check if receiver port has observed all 1’s on all outputs. Test is decalred partially 
successful if all 1’s are observed on RxDO[19:0]. 

Two more test loops as described above are carried out, first for data pattern of 0’s and next one for 1’s. Test 
is declared successful only if all 3 data patterns are successfully observed on output port. 


May 14, 2014 70 Rev 51328 


SiCortex Confidential 2.12. ERROR RECOVERY PROCEDURE 


There are 7 instances of QPMA in ICE9. Following steps are recommended for testing PLL in bypass mode of 
each QPMA. 


1. Configure TXPLL in reset by setting bits TxPlIRst of RLQscQpmalmpCalibratio. 

2. Configure CDRPLL in reset by setting bits of CDRPLLRst of R-QscQpmaControl. 

3. Configure QPMA in power-up mode by clearing bit so f RxPwrDown of R-QscQpmaControl. 

4. Disable IDDQ mode of QPMA by clearing IDDQ bit of R-LQscQpmaControl. 

5. Force transmit and receive macro in HiZ by setting ForceTxHiZ and ForceRxHiZ bits of RLQscQpmaControl. 

6. Enable near end loopback by setting LpBkNearEnd bits of R-LQscQpmaControl. 

7. Enable high frequency transmit and receive clock by asserting TxHFClkDnB and RxHFC1kDnB of R-QscQpmaTestControl. 


8. Wait for 400 sclk cycles (enough time for data input patterns to propagate from TxDI to RxDO register) and 
then check if PlIBpStatus bits RLQscQpmaStatus to verify test result. 


2.12 Error recovery procedure 


Fabric link CSRs are designed to capture and hold cause and state of the error. These status registers are 
cleared by SCB master. The SCB master should clear error state and error status register(s) before reverting to 
normal mode of operation. 


2.12.1 Force Retraining 


The ForceRT bits of csr-2.16.10 and csr-2.17.10 allow forcing retraining sequence on respective controller. The 
retraining routine should be forced by SCB master only after clearing of all error states in respective controller. 
The retraining routine can also be forced while respective controller is in Mission Mode. 


2.13 Bring-Up Failure Points 


The link bring-up process can fail at a variety of detectable points. Here is a list of them, and what it may 
mean if you fail at each point. Possible example define-names are given in all-capitals for each failure. 

The first few are ”’whole-node”, and later ones are ”for a link”. 

For a given FLR or FTL, these are listed in the same order as the actions (and checks) are done, doing first 
the whole-node actions, then the actions for the given FLR or FLT. So, if you fail at a particular point in this list, 
that means all previous actions for that link were successful. (exception: ERR_FL_INIT.CODE_<n>) 

In a failure, it would be nice also say what the other end is, which board/node/link, or have system-sensitive 
diags function like ”print_link_other_end(this_board, this_node, this_link)”. 


These first 4 failure points have to do with calibrating the 7 qgpmas: 


ERR_FL_TXPLL_NO_LOCK = Not all of the Tx PLLs locked. Specifically, failed to get RLQscQpmaStatus.TxClkStable 
in at least 1 of the 7 qpma’s, after waiting long enough after setting and then clearing the 7 RLQscQpmalmpCalibration.RxPIIRst 
bits. 

Look at whether you’ve started chip clocks and voltages correctly. If you still get this, you probably have a bad 
Ice9 chip (bad Tx PLL). 

Note that if you are using chips that had the normal testing at the chip vendor, the packaged chips have been 
tested for this being good. The same is true for failures below where “bad chip” is likely cause. 


ERR_FL_ZCALIB_TOO_HI = The determined ZCalib transition point was above the legal range, or ZCom- 
pOp was 1 no matter how high a ZCalib was tried, on at least one qpma. 

ERR_FL_ZCALIB_TOO_LO = The determined ZCalib transition point was below the legal range, or ZCom- 
pOp was 0 no matter how low a ZCalib was tried, on at least one qpma. 
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Out-of-legal-range ZCalib transition point in one qpma in an Ice9 supplied proper voltages and clocks, indicates 
a bad Ice9 chip. 

It’s nice if values, good or bad, go into a log file somewhere, in case in a later step we have excessive bit errors 
we're trying to diagnose. 


ERR_FL_QSC_WR_FAIL = When setting qpma lanes with A T P R values, at least one lane didn’t get 
QscSuccess within a reasonable time. 

Since these are done individually, the software knows which ones failed. Repeated failure is due to bad chip, 
incorrect software sequence or Qsc addresses, or inadequate wait time. 


Once you've calibrated the 7 gpmas, you can bring-up (or not bring up) each of the 8 FLTs and each of the 3 
FLRs separately. If you don’t have working Ice9s at the other end of some of these 6 links, the others can still be 
brought up to MissionMode, and transfer packets. 

The following errors must be clear whether it’s FLR or FLT, and which link. This information could be made 
part of the error-define-name, or be provided as extra information. 


ERR_FL_RXPLL_NO_UNLOCK = After resetting Rx PLLs for this link’s lanes, one or more failed to 
unlock in a reasonable time. (FLT has only 1 Rx Lane) 

You should be able to unlock PLLs no matter what the Ice9 at the other end is doing. Failure here suggests 
this Ice9 is bad, or it has bad configuration/clocks/voltages. 


ERR_FL_RXPLL_NO_LOCK_SOME = In this FLR, after unresetting the 8 data lane Rx PLLs, some failed 
to lock onto incoming signals in a reasonable time. 

ERR_FL_RXPLL_NO_LOCK_ALL = In this FLR, after unresetting the 8 data lane Rx PLLs, all 8 failed 
to lock onto incoming signals in a reasonable time. 

ERR_FL_RXPLL_NO_LOCK = In this FLT, after unresetting the control lane Rx PLL, it failed to lock 
onto incoming signal in a reasonable time. 

As shown above, for FLR I suggest writing the small extra code to differentiate between “all failed to lock” and 
”some failed to lock” because “all failed to lock” strongly suggests that the Ice9 at the other end has not completed 
initial calibration, is in reset, or there’s actaully NO Ice9 at the other end. 

Rx PLL locking is the first point in the process where we are affected by the Ice9 at the other end. Rx PLL 
locking is also the first point in the process where we are affected by bad connections, serious noise on the fabric 
between chips, or improper calibration on either end. 

Failure to get Rx PLL lock is a condition worse than the ”bit errors” which can cause problems in later steps. 
The following can interfere with Rx PLL lock, or cause bit errors and prevent LaneHealth: 

- Other end is not yet transmitting. 

- Other end has not finished calibration. 

- Other end has wrong Tx calibration. 

- This end has wrong Rx calibration. 

- This Ice9 or other-end Ice9 is in some diagnostic mode (see next section “Registers that can Prevent Link 
Coming Up”). 

- Signal not strong enough, serious reflections, or noise from outside of the differential pair (wrong Tx or Rx 
calibration). 

- Unstable Tx clock, other-end Tx PLL has not locked, or other-end sclk not stable. 

- One or both signals of differential pair have a bad connection. 

- Bad capacitor on differential pair 

- Other end is in reset. 

- Other end has power problems. 

- Other end is on a board that’s not plugged-in. 

- Bad Ice9 on either end. 

Note that you can get "false Rx PLL lock”. Reset then unreset of PLL is done to clear old false locks. If the 
signal on the differential pair is very bad, has data plus lots of noise, or not being driven but wires are picking-up 
noise, the Rx PLL might still lock onto what it sees. 

You might check whether PLL locking is coming and going. 

If it’s an FLR, Diagnostics should say WHICH rx lanes failed to lock. 
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ERR_FL_SKIPBEAT_FAIL = SkipBeat failed for at least one Rx data lane, when checked more than long 
enough after SkipBeat init. 

SkipBeat is aligning a divided-by-10 version of the clock formed by Rx PLL locking, with Ice9’s sclk. The most 
likely reason for SkipBeat failure is that you no-longer have Rx PLL lock, or it comes and goes. 

A valid SkipBeat is one that completes near the specified time period. Waiting far longer and eventually seeing 
SbSuccess is not valid. You should restart SkipBeat AFTER aquiring a consistent Rx PLL lock, then look for 
success. 


ERR_FL_NO_HEALTH = No LaneHealth on at least one Rx lane, when checked a sufficient time after 
getting SkipBeatSuccess on all Rx lanes. (FLT has only 1 Rx Lane) 

The hardware will continuously try to get LaneHealth, with no software-starting of Rotator needed. 

Do you still have Rx PLL lock? Even if you do, try re-doing Rx PLL lock and SkipBeat. 

Some reasons listed under ERR_FL_RXPLL_NO_LOCK can prevent LaneHealth, even if we have Rx PLL lock. 

With a weak signal or noise causing bit errors in an ongoing manner, there may be enough edges with equal 
spacing to sustain Rx PLL lock, while causing enough character errors to prevent LaneHealth, or cause LaneHealth 
to come and go. 

With LaneHealth==0 you should see Rotator trying different values, and character error counts increasing. 

A miss-match of configured sclk frequencys between 2 Ice9’s may go unnoticed up to this point, where you are 
unable to get LaneHealth. 

Do we have R_QscQpmaStatus.RefClkStable at both ends? 

Is this Ice9 or other-end Ice9 in some diagnostic mode? See next section “Registers that can Prevent Link 
Coming Up”. 

If it’s an FLR, Diagnostics should say WHICH rx lanes can’t get LaneHealth. 


At this point the FLT or FLR is started into the LinkSync routine, which will try repeatedly to go through Step1, 
Step2, Step3, Step4, to MissionMode, unless, in uncommon cases, it gets stuck at some step. See ERR_FL_MISLGONE 
below for more details on getting stuck in Steps. 


ERR_FL_SYNC_BIT_ERRS = Bit errors on at least one lane in this link, even though LinkHealth is good, 
while waiting to get to MissionMode. 

After getting LinkHealth (same as LaneHealth on all lanes), all bit error counters should be cleared. This 
condition is that you got new bit errors, after clearing the counters. 

If the number of bit errors is unchanging, wait awhile and the link may still achieve MissionMode. 


ERR_FL_SYNC_LOST_HEALTH = LaneHealth is false on at least one lane in this link, while waiting to 
get to MissionMode. 
The hardware will try to recover LaneHealth, and if it can, the link may still make it to MissionMode. 


ERR_FL_SYNC_TIMEOUT = MissionMode not achieved on this link after too long a time in state sc- 
fab_link_state_syncing. 

See ERR_FL_MISLGONE below for more details on different cases. 

If bit errors are not changing, LaneHealth is good, something’s wrong, Link may be stuck, and may need to be 
restarted by Software. 


The error codes below are for after MissionMode has been achieved. Using different error-defines after Mission- 
Mode gives a little more information, 

If MissionMode is lost, the hardware will repeatedly try to recover through the LinkSync routine to MissionMode, 
unless it gets stuck. 

You can read FlrxRaLcStatus or FitxcTrLcStatus to see many aspects of link state in one register-read: Mission- 
Mode, LinkSync-active, LinkSync Step, whether all Ra PLLs are locked, whether all lanes have LaneHealth. 

The following error cases are listed from lightest-to-heaviest badness: 


ERR_FL_MISLBIT_ERRS = After MissionMode achieved, MissionMode stays up adequately, but bit errors 
keep happening on this link. 


ERR_FL_MISLLCCOUNT_HI = MissionMode is up now, but it comes and goes too often. 


May 14, 2014 73 Rev 51328 


SiCortex Confidential CHAPTER 2. INTERNODE LINK 


This error is being reported because ”too many” past reads of LcStatus.MissionMode gave 0, or because FlrxRxL- 
cCount or FltxTxLcCount has become ”too high”, or we see that count continue to increment. 


ERR_FL_MISLGONE = After MissionMode achieved, MissionMode is now false in this link. Rx PLLs are 
still locked. Rx Lanes have LaneHealth now. 

Poll for MissionMode for enough time for Syncing to happen. If still no MissionMode after reasonable time, 
Software should re-start the LinkSync routine. 

If Steps==1 and is unchanging, or repeately going back to Steps==1, it may have excessive bit errors. Check 
all lanes for LaneHealth, and see if BitErrors are changing. 

If Steps==2, or a mix of Steps==1 and Steps==2 for a long time, we have a low but consistent rate of bit 
errors, preventing Syncing. 

In an FLR, if Steps==4 (Step3), with FlrxRxLcStatus.RxLinkSync==1 for a long time, the FLT at the other 
end should be looked at. It’s either getting lots of bit errors over the control lane from this FLR, or the control 
lane seems dead, or FLT hasn’t been started to do Syncing. 

In an FLR, if Steps==8 (Step4) for too long, we have ’the Step4 hang” due to infrequent bit errors, and Software 
must restart. 

It’s ok for an FLT to have Steps==8 (Step4) for a fairly long time, while the FLR at the other end tries 
repeatedly to get MissionMode. But it also might be that the FLR is stuck or has a false-MissionMode. If excessive 
time passes, the FLT can try a restart of LinkSync, which can clear some conditions. If it doesn’t, the problem 
must be dealt with at the other end, in the Ice9 containing the FLR. At the other end you can see if FLR is stuck 
(requiring a restart), or one or more of the data lanes from FLT is having excessive bit errors, or seems dead. 

After MissionMode, in FLT and FLR, if no SoftReset has been done, either 

(a) MissionMode==1, LinkSync==0, Steps==0, or 

(b) MissionMode==0, LinkSync==1, Steps== one of 1, 2, 4, 8. 

Any other combination is a (rare) corrupt state, and you should restart the link, doing SoftReset first. After 
SoftReset expect MissionMode==0, LinkSync==0, Steps==0. 

ERR_FL_MISILNO_HEALTH = After MissionMode achieved, MissionMode is now false in this link. Rx 
PLLs are still locked, but LcStatus.LinkHealth==0 consistently with repeated reading, which means at least one 
Rx Lane has LaneHealth==0. 


ERR_FL_MISIL_RXPLL_NO_LOCK = After MissionMode achieved, MissionMode is now false in this link. 
AllReset (or Al[RxLanesReset) is 0 which means at least one Rx Lane has lost PLL lock. 

ERR_FL_MISL_LRXPLL_NO_LOCK_ALL = In this FLR, after MissionMode achieved, MissionMode is now 
false in this link. AllRxLanesReset is 0, but furthermore, all 8 bits of FlrxLinkStatus.CdrPllLock are 0, consistently. 
This suggests a shut-down or removal of the Ice9 at the other end. 


ERR_FL_INIT_CODE_<n> = Link bring-up failed in one of the places where a software sanity check is 
done. This failure had nothing to do with hardware behavior. <n> is a unique number for each such place in the 
link bring-up code. 


2.14 Registers That Can Prevent Link Coming Up 


After any diagnostic or manual mode has been used, like bit-blasting or loopback, you need to either restore all 
registers to their normal values or do a hard-reset before attempting normal bring-up. Bring-up software typically 
doesn’t write reset values to registers it would not otherwise be writing. Even when bring-up software writes a 
register, that software may be carefully leaving-unchanged fields within a register it’s not actively using. 

Abnormal configuration in an Ice9’s registers (or the Ice9 at the other end of the Link) can prevent a Link from 
coming up to MissionMode. 

These registers (or specific fields) could prevent bring-up if badly configured: 


R_FltxSoftReset 

R_FltxFcLaneControl (fields: ForceSkipBeat, SkipBeatOffset) 
R_FItxAltNull 

R_FltxHeartbeat (fields: Dis, Threshold) 

R_FIltxS2WaitTime 


May 14, 2014 74 Rev 51328 


SiCortex Confidential 2.15. COMMON REGISTERS AND DEFINITIONS 


R_FltxMOR 
R_FltxFarEndLoopback 
R_FItxBBDiag 


R_FlrxSoftReset 

R_FlrxWSyncMode 

R_FlrxHeartbeat (fields: Dis, Threshold) 

R_FlrxS2WaitTime 

R_FlrxLaneControl|[7:0] (fields: ForceSkipBeat, SkipBeat Offset) 
R_FlrxMORI7:0] 

R_FlrxBBDiag 


R_QscQpmaControl]6:0] (all fields other than CDRPLLRst) 
R_QscQpmaTestControl|[6:0] 


Also, if you can’t bring up a link because of excessive interrupts, maybe a link interrupt is inappropriately 
enabled, or wasn’t cleared. 


2.15 Common Registers and Definitions 


2.15.1 Package Attributes 
Package 


chip_fi_spec 


Attributes 


—public_rdwr_accessors 


2.15.2 Definitions 
Defines 
FL 


Definition 
10’h7f STEP2_WAIT_TIME | Sleep timer value. Cycles to wait in step2. 


2.15.3 Link Symbols 


Enum 


FlSymbols 


(Code Name) 
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2.15.4 Flr Events 


The following events are trackable by SCB statistical event counting. 


Enum 


FlrScbEvent 


Attributes 


-descfunc 


8’h00 CYCLES | Sclk cycles. Always counts. 
ShOMSR | Resened SSS 


2.15.5 Flt Events 


The following events are trackable by SCB statistical event counting. 


Enum 


FltScbEvent 


Attributes 


-descfunc 


8’h00 CYCLES | Sclk cycles. Always counts. 
shorenr] | Reserved, SSCS 


2.16 FLT Registers 


2.16.1 R_FIltxSoftReset 
Register 
R_FItxSoftReset 


Attributes 


-kernel 


Address 
0x0_0000_0000 (plus base address) 


SoftReset RW Reset Link when set. When written 1, transmitter link 
remains in reset state. When written 0, the transmitter 
link logic come out of the reset state. 


Operation of SoftReset 


When SoftReset is asserted, all CSRs of FLTx remain unaffected by SoftReset. However, control flops within 
FLTx module are intialized to power-on reset value. After de-assertion of SoftReset, software will have to initiate 
skipbeat function on its flow control lane and then enable transmit link. 
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2.16.2. R_FIltx FC Lane Control Register 
Register 


R_FItxFcLaneControl 


Attributes 


-kernel 


Address 


0x0_0000_0004 (plus base address) 


ClrLaneHealth RW Clear lane health. 
For every transition of 0-to-1 of this bit, lane health bit 
of FC lane is cleared. 


a 


ForceSkipBeat Force Skipbeat. 
This bit must remain clear when SkipBeatEnable is clear. 
When SkipBeatEnable is set : For every transition of 0- 
to-1 of this bit, RxClk offset is skipped 1-bit time. This 
field is intended to be used in manual setting of RxClk. 
This bit should be clear after manual setting of RxClk is 
completed. 
Skip Beat Enable. 
At the transition from 0-to-1, SkipBeat function is exe- 
cuted once using value selected in “SkipBeatOffset”. 
To initialize Skip Beat function, write 1 followed by write 
0. 
For manual setting of skipbeat, write 1, then use 
ForceSkipBeat (above), then write this bit 0. 
SkipBeat Offset. 
The receiver RxClk offset is equal to “SkipBeat Offset” bit- 
time wrt sclk. 
The power-on default value is 5(hex). 
This field is 4-bit wide and SkipBeatOffset can be selected 
from O(hex) to 9(hex). The values in this field are modulo- 
10. 
For applying newer value of SkipBeatOffset, SkipBeatEn- 
able should be toggled. 


Operating modes of Skipbeat function 


At the end of reset sequence, SkipBeatOffset field value defaults to 0x5. It holds offset value in bit-time. At 
200Mhz of sclk, bit time is 0.5nsec. 

SCB master can modify SkipBeatOffset value and invoke skipbeat function by toggling SkipBeatEnable bit once. 
This method triggers skipbeat function with selected SkipBeatOffset value. Please note that during this process, if 
any time reset sequence is invoked then SkipBeatOffset will be defaulted to 0x5. 

For manual SkipBeat setting, set SkipBeatEnable=1, (the SkipBeat function will run, completing faster than 
you can do your next register access), then use single step (sample and move) manual skipbeat algorithm. To 
do this, repeatedly (a) sample state of “SbTestaRxClkN” of R_FltxLaneStatus register, and (2) move phase of 
receiver clock 1-bit time by toggling “ForceSkipBeat”. Note where SbTestaRxCIkN transitions 0-to-1 and 1-to-0, 
then do additional skips to position the offset correctly relative to those transitions. After desired phase alignment 
of receiver clock is achived, SkipBeatEnable bit should be cleared. 
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2.16.3 R_FItx Lane Status 
Register 


R_FltxLaneStatus 


Attributes 


-noregtestcpu -kernel 


Address 


0x0_0000_0008 (plus base address) 


23:16 PliLock Pll Lock status. Holds lock status of 8 Tx PLLs of QPMA 
module. 


im | Rotator |__|» [|_| Fe lane rotatoralne——SC—C—~—“—SC“—S~—~—~S~*~S 
eh a A a a es 
LaneHealth | | FC lanehealth status. lane health status. 


FcPllLock FcPLL lock status. Holds lock status of fe PLL of ST eT 
module. 


| | Reservede eee 


eG —— Se signal. 
It holds sampled value of aRxCIKN signal from Qpma. 


SbSuccess SkipBeat Success. It indicates status of last skipbeat op- 
eration. 
When set, indicates that SkipBeat function has been suc- 
cessful. 


ee [ieee be Slee Active. 

bolas set, indicates that SkipBeat operation is active. 
Lal eal State of SkipBeat First search function. 

When set, indicates that the First Search is completed. 
SbSecondSearch R State of SkipBeat Second search function. 
R 


When set, indicates that Second search is completed. 
1 x State of SkipBeat Final search function. 
ina eine When set, indicates that Final search is completed. 
SbAdjust R x State of SkipBeat Adjust function. 
ed, oe oe ae ee When set, indicates that Adjustment is completed. 


2.16.4 R_FltxInvCFc 


SbFinalSearch 


Register 


R_FltxInvCFc 


Attributes 


-kernel -writeonemixed 


Address 


0x0_0000_000c (plus base address) 
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18 Intr RW1C Invalid Character error interrupt from FltxInvCFc. 
This bit is set if IntEna is set AND (Compare == 
Counter) 
ar [ nite | RT| 0 | va Character oor interpre ena Tor Fein 
Wrap Enable wrap mode for FltxInvCFc. 
[ee es Lo When set, Counter wraps on maximum count. 


fd fied Invalid character error counter comparator for FltxIn- 
vCFc. 


Counter Invalid Character error counter for FltxInvCFc. 
Counts up when invalid charater error is detected on lane. 
Wraps on maximum count of 8’hFF if Wrap is set. 
Note: Counter does not count up in clock cycle in which 
FltxInvCFc is being read or written to by SCB. 


2.16.5 R_FIltxDispFc 
Register 


R_FltxDispFc 


Attributes 


-kernel -writeonemixed 


Address 
0x0_0000_0010 (plus base address) 


18 Intr RW1C Disparity error interrupt from FltxDispFc. 
This bit is set if IntEna is set AND (Compare == 
Counter). 
a |__| Disparity error interrupt enable for FltxDispFc. | 
Lee an Enable wrap mode for ESD ERE 
iM rr 


best ee batt — Depa error counter comparator for FitxDispFe. 


7: 7 Counter RW Disparity error counter for FltxDispFc. 
Counts up when disparity error is detected on lane. Wraps 
on maximum count of 8’hFF if Wrap is set. 
Note: Counter does not count up in clock cycle in which 
FltxDispFc register is being read or written to by SCB. 


2.16.6 R_FItxAltNull 
Register 
R_FItxAltNull 


Address 
0x0_0000_0014 (plus base address) 


Definition 


4 Ena RW 1 Enable driving AltNull during IDLE cycle. When clear, 
AltNull will not be driven at any setting of ANullRate. 


3:0 Rate RW Rate of AltNull during IDLE cycles. When ANullEnable 
and setting is 0, only altNull will be driven on IDLE cy- 
cles. 
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2.16.7 R_FltxHeartbeat 
Register 


R_FltxHeartbeat 


Attributes 


-writeonemixed -kernel 


Address 


0x0_0000_0018 (plus base address) 


13 Intr RW1C Heartbeat error interrupt from Fltx. 

This bit is set if IntEna is set AND loss of heartbeat occurs 
in MissionMode. 

All 3 ways of losing MissionMode (force-retraining, loss- 
of-link-health, and heartbeat-timeout) are considered a 
Heartbeat Error, and will cause a Heartbeat error inter- 
rupt. Once Intr bit is set, it will need to be cleared or 
disabled to clear the main interrupt from the link. 


| 12 | IntEna | RW | 0 | | Heartbeat error interrupt enable for Fltx. 


11 Init RWS Heartbeat Init. 
For every transition of 0-to-1, heartbeat counter is initial- 
ized to its reset state once. 
Note: Writing 1 to this field has side effect. 


ee Disable. When set, heartbeat never expires and 
ee heartbeat function is disabled. 


Em iad ee Heartbeat Threshold. Holds threshold value in max num- 
ber of clock cycles during which heartbeat must be de- 
tected. 


2.16.8 R_FltxDriveError 


Register 


R_FIltxDriveError 


Address 


0x0_0000_001c (plus base address) 


23:16 | TBadChar RWS Drive Bad Character. 
On transition from 0-to-1, bad character is driven on lane. 
Each lane is assigned a bit in this field. 

15:8 TBadDisp RWS Drive Bad disparity. 
On transition from 0-to-1, bad disparity is driven on lane. 
Each lane is assigned a bit in this field. 


7:0 | TCharError RW Create transmit Error. A bit is assigned to each lane. 
A bit is set to reflect created error on lane as per either 
TBadChar or TBadDisp field. 
In system level testing, error created on lane(s) should 
also be detected by corresponding 8B10B decoder lanes 
of receiver chip. 
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2.16.9 R_FIltxTxLcStatus 
Register 


R_FItxTxLcStatus 


Attributes 


-noregtestcpu -kernel 


Address 


0x0_0000_0020 (plus base address) 


AllReset R x Holds status of flag-AllLanesReset. 
Note that this status field holds status of one receiver lane. 
When set, it indicates that PLL is locked and lane has its 
reset de-asserted. 


Linkhealth | R | x | | Holds status of flag-LinkHealth. 


[o_[TxinkSme [R[x |__| Holds status of flag TxinkSyne———SSSSSOSC—SY 
5_| MissionMode [R[x |__| Holds status of flag-MissionMode——SSS—S 
[a] Heartbeat [R[x |_| Holds status of flag-Heartbeat. —SS—~—S 
PRO] Steps [R[x |_| Holds status of Step-12ad of DO ——SSS—S 


2.16.10 R_FltxTxLcControl 
Register 


R_FItxTxLcControl 


Attributes 


-kernel 


Address 


0x0_0000_0024 (plus base address) 


1 Ena RWS Enable TxLinkSync. When set, hardware execution rou- 
tine TxLinkSync is enabled. After setting this bit, write 
ForceRT bit to initiate TxLinkSync. 


ForceRT RWS Force Retraining or execute TxLinkSync routine. 
ON transition from 0-to-1 of this bit will force re-entry to 
TxLinkSync routine. 


2.16.11 R_FltxTxLcCount 


Register 


R_FItxTxLcCount 


Attributes 


-kernel 
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Address 
0x0_0000_0028 (plus base address) 


7:0 | TxLcCount RW TcLcCount. 
Counter holding number of times hardware routine Txlc is 
evoked. The counter will count up when TxLc goes from 
Step-1 to Step-2. Counter will wrap on maximum count. 


2.16.12 R_FItxS2WaitTime 
Register 


R_FItxS2WaitTime 


Address 
0x0_0000_002c (plus base address) 


Step2WaitTime RW Ox7F Step2 sleep timer value. 
Cycles to wait in step2. Default value is set at 7F(hex) 
ie 127 x 5 = 635ns. 


What is Step2WaitTime? 


The Step2WaitTime is the time required to insure that Link between two ICE9 is filled with NULL characters 
only. The default setting of 0x7f is initialized at power-on which equals the waiting time of 635ns in system when 
SCLK is operating at 200 MHz. To change Step2WaitTime setting after power-on, (a) put FLT into SoftReset, 
then (b) write the new value into S2WaitTime, and then (c) remove SoftReset. Also it is strongly suggested to 
avoid depositing any value lower than Ox0f as Step2WaitTime because such lower value may not be sufficient to 
insure that Link between two ICE9 is filled with NULL characters. 


2.16.13 Fltx Manual Override Rotator (MOR) 
Register 
R_FIltxMOR 


Address 
0x0_0000_0030 (plus base address) 


ManualOverrideRotator Manual override or Force Rotator Setting for flow control 
lane. 
When set, rotator function in framer is disabled and ro- 


tator value specified in RotatorSetting is forced. 

Rotator Setting. 

Note that Rotator setting from 0x9 to OxF are assumed 
to be at value of 0x9. 


How Manual Rotator Override function works? 


Manual Override Rotator (MOR) function may be activated if automatic Linksync routine fails and failure 
points to rotator function. 


1. To activate MOR, select RotatorSetting (between 0x0 through 0x9) and set ManualOverrideRotator bit. 


2. Next, initiate Linksync routine by accessing R-FltxTxLcControl register. 
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3. During discovery of valid RotatorSetting, Rotator field of R_FltxLaneStatus is invalid because it captures 
rotator setting in automatic Linksync routine. However, Lanehealth field of R_FltxLaneStatus will indicate 
valid status of lane health. In MOR, if LaneHealth is true then correct rotator setting has been acquired 
other wise other values of rotator setting should be tried. 


4. During MOR, all fields of R_FltxTxLcStatus are valid and content of this register should be used to find 
correct rotator setting. 


5. After discovering correct rotator setting re-initiate Linksync routine by accessing R-FltxTxLcControl register. 


2.16.14 R_FltxFarEndLoopback 
Register 


R_FltxFarEndLoopback 


Address 


0x0_0000_0034 (plus base address) 


FarEndLpBk RW Far End Loopback Mode. 
When set, it indicates that Far end loopback mode is ac- 
tive. 
When set, puts both, the FLT and FLR, with the same 
link number into FarEndLoopback mode as described in 
section-2.11.2. 


2.16.15 R_FltxBBDiag 


Register 


R_FItxBBDiag 


Address 


0x0_0000_0040 (plus base address) 
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(Pacman [ees [et [pe [peg 
iia Wo] ii ———$—$ 


Hy 73 FcPattern RW Receive bit-blasting paiiens type on flow control lane. 
(a) 0x0 - repeat k28.5 (PNULL) 31 times and k28.0 (AN- 
ULL) (once) 
(b) 0x1 - PNULL (k28.5 ) 
(c) 0x2 - D10.2 (0x4A) 
(d) 0x3 - D24.3 (0x78) 
(e) 0x4 - IKJPAT pattern to stimulate inter-symbol inter- 
ference (ISI) in ac-coupled system. 
Loop of 484 Character: 
D30.3 (Ox7E) 167 times 
D20.3 (0x74) once 
D30.3 (Ox7E) once 
D11.5 (OxAB) once 
D21.5 (OxB5) 51 times 
D30.2 (0x5E) once 
D10.2 (0x4A) once 
D30.3 (Ox7E) 4times 
D30.7 (OxFE) once 
D20.7, D11.7 (OxF4EB) 128 times 


FcLaneSel Flow control lane select for bit-blasting pattern. 
When set, flowcontrol lane is receiving pattern selected by 
FcPattern field. 


10:8 | TxPattern RW Transmitter bit-blasting pattern type. 
(a) 0x0 - repeat driving k28.5 (PNULL) 31 times and 
k28.0 (ANULL) (once) (b) 0x1 - drive PNULL (28.5 ) 
(c) 0x2 - drive D10.2 (0x4A) 
(d) 0x3 - drive D24.3 (0x78) 
(e) 0x4 - drive IKJPAT pattern to stimulate inter-symbol 
interference (ISI) in ac-coupled system. 
Loop of 484 Character: 
D30.3 (Ox7E) 167 times 
D20.3 (0x74) once 
D30.3 (Ox7E) once 
D11.5 (OxAB) once 
D21.5 (0xBS5) 51 times 
D30.2 (0x5E) once 
D10.2 (0x4A) once 
D30.3 (Ox7E) 4times 
D30.7 (OxFE) once 
D20.7, D11.7 (OxF4EB) 128 times 


7:0 | TxLaneSel RW Transmitter lane select for bit-blasting pattern. 
A bit is assigned to each of 8 lanes. 
When BBEnable and this is set, selected lane is enabled to 
drive bit-blasting pattern as selected by TxPattern field. 
When BBEnable and this bit is clear, selected lane drives 
PNULL (k28.5) pattern. 


2.16.16 Fltx_BBDiagStatus 


Register 
R_FItxBBDiagStatus 
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Attributes 


-noregtestcpu 


Address 


0x0_0000_0044 (plus base address) 


1 | FcLaneSync R x Lane synchronization status of flow control lane. 
This bit will be set in BBMode, when dlow control lane is 
selected to check for FcbbPattern and it finds FcbbPat- 
tern. 
This bit will remain clear in BBMode, if flow control Lane- 
Select bit is clear. 


FcBBError R x Bit Blasting error on flow control lane. 
This bit will be set in BBMode, if LaneSync is set and 
thenflow control lane detects BBPattern error. Otherwise 
this bit will remain clear. This bit will also remain clear 
if flow control LaneSelect bit is clear. 


2.17 FLR Registers 
2.17.1 R_FIrxSoftReset 


Register 
R_FlrxSoftReset 


Attributes 


-kernel 


Address 
0x0_0000_0000 (plus base address) 


SoftReset Reset Link when set. When written 1, receiver link re- 
mains in reset state. When written 0, the receiver link 


logic come out of the reset state. 
FlrCsr module remains unaffected by SoftReset. 


Operation of SoftReset 


When SoftReset is asserted, all CSRs of FLRx remain unaffected by SoftReset. However, control flops within 
FLRx module are intialized to power-on reset value. After de-assertion of SoftReset, software will have to initiate 
skipbeat function on its receiver lanes and then enable receiver link. 


2.17.2. R_FlrxLinkStatus 
Register 
R_FIrxLinkStatus 


Attributes 


-noregtestcpu -kernel 
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Address 


0x0_0000_0004 (plus base address) 


DrivenBadFcChar RW Drove bad FC character and created transmit Error. 
When set, it reflects that transmit error was created 
by setting either DriveBadChar or DriveBadDisp of 
R_FLrxLinkControl. 

In system level testing, error created on flow control lane 
should also be detected by corresponding 8B10B decoder 
lane of receiver chip. 


| = PilLock Feat ee ee Lock status of TxPLL of FC lane. 
| Lock status of CdrPLL. Holds lock status of CDR PLL of 
eight receiver PLLs in QPMA. 


2.17.3 R_FlrxLinkControl 


Register 
R_FlrxLinkControl 


Address 


0x0_0000_0008 (plus base address) 


Definition 


1 DriveBadChar | RWS Drive bad character. 
On transition from 0-to-1, one bad or invalid character is 
driven on FC lane. 


DriveBadDisp | RWS Drive bad disparity. 
On transition from 0-to-1, one character is driven with 
disparity error on FC lane. 


2.17.4 R_FlrxRotator 


Register 


R_FlIrxRotator 


Address 


0x0_0000_000c (plus base address) 


Definition 


31:0 Rotator R x Rotator Status. Rotator status of eight lanes. Each lane 
is assigned 4-bit wide field. 


2.17.5 R_FlrxRxLcStatus 


Register 
R_FlIrxRxLcStatus 


Attributes 


-noregtestcpu -kernel 
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Address 


0x0_0000_0010 (plus base address) 


AlIRxLanesReset R x Holds status of flag-AllRxLanesReset. 
When set, it indicates that PLL is locked and eight lanes 
have their reset signals de-asserted. 


Linkhealth [| R [| x |_| Holds status of flag-LinkHealth. 


o_ | RetinkSync [R[x |_| Holds status of flagRxkinkSyne——SSSSCSCS—SY 
[5 _[_MissionNode [R[x |__| Holds status of flag-MissionMode———S—S 
[a | Heartbeat [R[x |__| Holds status of flagHeartbeat. SY 
PaO [Steps [R[x |_| Holds status of Step-12,34 of PXXO———SS—S—S 


2.17.6 R_FlrxLaneHealth 
Register 


R_FlIrxLaneHealth 


Attributes 


-kernel 


Address 


0x0_0000_0014 (plus base address) 


15:8 | ClrLaneHealth RW Clear lane health. 
On transition from 0-tol, lane’s health bit is cleared. 
Each lane is assigned a bit in this field. 


LancHealth [| R | x [| Lane health status. Each lane is assigned 1-bit field. 


2.17.7 R_FlrxWSyncMode 
Register 


R_FlrxWSyncMode 


Address 


0x0_0000_0018 (plus base address) 


May 14, 2014 87 Rev 51328 


SiCortex Confidential CHAPTER 2. INTERNODE LINK 


27:20 | MwsEnab RW Manual WordSync Enable. 
This field is 8-bit wide and one bit is assigned to each 
lane. 
When Mws_Enab|x] is set, corresponding 2-bit wide lane 
selector setting of wsync multiplexer is forced. 
When MwsEnab[x] is clear, corresponding wsync multi- 
plexer setect setting is set by automatic wordsync opera- 
tion. 
Manual Wordsync setting. 
This field is 16-bit wide and has 8 groups. 
Each group is 2-bit wide and assigned to a lane. Bits[5:4] 
are assigned to Lane-0, bit[7:6] are assigned to Lane-l, 
and so on. 
For each lane, 2-bit field holds select value for 4-to-1 wsync 
su 


a 


ee a ee Force Wayiie cycle. 
On transition from 0-tol of this bit forces RxLinkSync 
routine to enter in Step4. 

DisVerror RW Disable Wsync pattern verification error. If this bit is set 
then pattern verification logic in step-4 does not detect 
any errors. Setting of this bit allows successful completion 
of step-4. 


How Manual Override Wordsync mode works? 


Manual Wordsync operation can be invoked if skipbeat and rotator functions are working but unexplained crc 
errors (without disparity error(s) and/or invalid character error(s)) are observed. In Manual Wordsync Override 
mode, link is forced to enter MissionMode so that characters from all 8 lanes are sent to fabric switch. Though each 
lane may be individually forced in Manual Override Wordsync mode (MwsEnab), preferred method is to force all 
8 lanes in Manual Override Wordsync mode be setting all bits of MwsEnab field and by selecting individual lane’s 
2-bit wordsync setting (Mws). 


When MwsEnab is set, once link enters Step4, it stays in Step4 for the duration of time it takes to execute tasks 
of step4 (approximately 4 microsec) and then forces link to enter MissionMode. After link has entered MissionMode, 
R_FirxWsyncStatus_AutoSetting field is invalid but the rest of the bits of R-FlrxWsyncStatus are valid. 


2.17.8 R_FlrxWSyncStatus 
Register 


R_FlrxWSyncStatus 


Address 


0x0_0000_001c (plus base address) 
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19:4 | AutoSetting Wsync Auto Setting. 
This field is 16-bit wide and has 8 groups. Each group is 
2-bit wide and assigned to each lane. Lane-0 has bits[1:0], 
Lane-1 has [3:2], and so on. 
Reads wsync multiplexer settings in wsync auto opera- 
tion. 
Reading of this field is invalid if corresponding lane’s 
Mws_Enab bit is set in R-FlrxWSyncMode register. 


cane heise of wordsync operation. Set to 1 when wordsync is 
ee 

VError R This status biti is set when verify cycle detects error during 

Sa ae ce 


= a a a | sis This ee bit is set when wsync cycle is active. 


Verify R This status bit is set when verify cycle is active during 
Wsync . 


2.17.9 R_FlrxHeartbeat 


Register 


R_FlrxHeartbeat 


Attributes 


-writeonemixed -kernel 


Address 
0x0_0000_0020 (plus base address) 


13 Intr RW1C Heartbeat error interrupt from Flrx. 

This bit is set if IntEna is set AND loss of heartbeat occurs 
in MissionMode. 

All 3 ways of losing MissionMode (force-retraining, loss- 
of-link-health, and heartbeat-timeout) are considered a 
Heartbeat Error, and will cause a Heartbeat error inter- 
rupt. Once Intr bit is set, it will need to be cleared or 
disabled to clear the main interrupt from the link. 


| 12 | IntEna | RW | 0 [| | Heartbeat error interrupt enable for Flrx. 


11 Init RWS Heartbeat Init. 
For every transition of 0-to-1, heartbeat counter is initial- 
ized to its reset state once. 
Note: Writing 1 to this field has side effect. 


ica Disable. When set, heartbeat never expires and 
dee heartbeat function is disabled. 


Le Heartbeat Threshold. Holds threshold value in max num- 
ber of clock cycles during which heartbeat must be de- 
tected. 


2.17.10 R_FlrxRxLcControl 


Register 
R_FlrxRxLcControl 
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Attributes 


-kernel 


Address 


0x0_0000_0024 (plus base address) 


1 Ena RWS Enable RxLinkSync. When set, hardware execution rou- 
tine RxLinkSync is enabled. After setting this bit, write 
ForceRT bit to initiate RxLinkSync. 


ForceRT RWS Force Retraining or execute RxLinkSync routine. Setting 
of this bit will force re-entry to RxLinkSync routine. 
2.17.11 R_FlrxRxLcCount 


Register 
R_FlrxRxLcCount 


Attributes 


-kernel 


Address 
0x0_0000_0028 (plus base address) 


7:0 | RxLcCount RW RxLcCount. 
Counter hoilding number of times hardware routine Rxlc 
is evoked. The counter will count up when RxLc goes 
from Step-1 to Step-2. Counter will wrap on maximum 
count. 


2.17.12 R_FlrxS2WaitTime 


Register 
R_FlrxS2WaitTime 


Address 
0x0_0000_002c (plus base address) 


Step2WaitTime RW Ox7F Step2 sleep timer value. 
Cycles to wait in step2. Default value is set at 7F(hex) 
ie 127 x 5 = 635ns. 


What is Step2WaitTime? 


The Step2WaitTime is the time required to insure that Link between two ICE9 is filled with NULL characters 
only. The default setting of 0x7f is initialized at power-on which equals the waiting time of 635ns in system when 
SCLK is operating at 200 MHz. To change Step2WaitTime setting after power-on, (a) put FLR into SoftReset, 
then (b) write the new value into S2WaitTime, and then (c) remove SoftReset. Also it is strongly suggested to 
avoid depositing any value lower than 0x0f as Step2WaitTime because such lower value may not be sufficient to 
insure that Link between two ICE9 is filled with NULL characters. 
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2.17.13 Flrx Lane Invalid Character Error Register 
Register 
R_FlrxLaneInvC[7:0] 


Attributes 


-kernel -writeonemixed 


Address 


0x0_0000_0030 - 0x0_0000_004c (plus base address) 


18 Intr RW1C Invalid Character error interrupt from FlrxLaneInvC. 
This bit is set if IntEna is set AND (Compare == 
Counter). 
acaat RW Invalid Character error interrupt enable for 
fee ie ee eG FlrxLanelInvC. 
16 Wrap Enable wrap mode for FlrxLaneInvC. 
Seen elmer 
for 


Compare eee Invalid character error counter comparator 
FlrxLaneInvC. 


Counter Invalid character error counter for FlrxLaneInvC. 
Counts up when invalid charater error is detected on lane. 
Wraps on maximum count of 8’hFF if Wrap is set. 
Note: Counter does not count up when FlrxLaneInvC 
register is being read or written to by SCB. 


2.17.14 Flrx Lane Disparity Error Register 


Register 
R_FlrxLaneDisp[7:0] 


Attributes 


-kernel -writeonemixed 


Address 
0x0_0000_0050 - 0x0_0000_006c (plus base address) 


18 Intr RW1C Disparity error interrupt from FlrxLaneDisp. 
This bit is set if IntEna is set AND (Compare == 
Counter). 


a Disparity error interrupt enable for FlrxLaneDisp. 


a aa Enable wrap mode for FlrxLaneDisp. 
When set, Counter wraps on maximum count. 


8 bear art — Disparity error counter comparator for FlrxLaneDisp. 


aoe 0 Counter RW Disparity error counter for FlrxLaneDisp. 
Counts up when disparity error is detected on lane. Wraps 
on maximum count of 8’hFF if Wrap is set. 
Note: Counter does not count up when FlrxLaneDisp reg- 
ister is being read or written to by SCB. 
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2.17.15 R_FlIrx Lane Status Register 
Register 


R_FIrxLaneStatus[7:0] 


Attributes 


-noregtestcpu -kernel 


Address 


0x0_0000_0070 - 0x0_0000_008c (plus base address) 


SbTestaRxClkN R x Test aRxCIKN signal. 
It holds sampled value of aRxCIKN signal from Qpma. 


ia 


PoE deel] 
fee pei eters | 2 
cs 


2.17.16 Flrx Lane Control Register 
Register 


R_FlrxLaneControl[7:0] 


Attributes 


-kernel 


Address 


0x0_0000_0090 - 0x0_0000_00ac (plus base address) 
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SkipBeat Success. It indicates status of last skipbeat op- 
eration. 

When set, indicates that SkipBeat function has been suc- 
cessful. 

SkipBeat Active. 

When set, indicates that SkipBeat operation is active. 
State of SkipBeat First search function. 

When set, indicates that the First Search is completed. 
State of SkipBeat Second search function. 

When set, indicates that Second search is completed. 
State of SkipBeat Final search function. 

When set, indicates that Final search is completed. 
State of SkipBeat Adjust function. 

When set, indicates that Adjustment is completed. 


92 Rev 51328 


SiCortex Confidential 2.17. FLR REGISTERS 


7 ForceSkipBeat RW Force Skipbeat. 
This bit must remain clear when SkipBeatEnable is clear. 
When SkipBeatEnable is set : For every transition of 0- 
to-1 of this bit, RxClk offset is skipped 1-bit time. This 
field is intended to be used in manual setting of RxClk. 
This bit should be clear after manual setting of RxClk is 
completed. 


pee Reserve 
SkipBeatEnable Skip Beat Enable. 
At the transition from 0-to-1, SkipBeat function is exe- 
cuted once using value selected in “SkipBeatOffset”. 


To initialize Skip Beat function, write 1 followed by write 


For manual setting of skipbeat, write 1, then use 
ForceSkipBeat (above), then write this bit 0. 

SkipBeat Offset. 

The receiver RxClk offset is equal to “SkipBeatOffset” bit- 
time wrt sclk. 

The power-on default value is 5(hex). 

This field is 4-bit wide and SkipBeatOffset can be selected 
from O(hex) to 9(hex). The values in this field are modulo- 


For applying newer value of SkipBeatOffset, SkipBeatEn- 
able should be toggled. 


Operating modes of Skipbeat function 


At the end of reset sequence, SkipBeatOffset field value defaults to 0x5. It holds offset value in bit-time. At 
200Mhz of sclk, bit time is 0.5nsec. 

SCB master can modify SkipBeatOffset value and invoke skipbeat function by toggling SkipBeatEnable bit once. 
This method triggers skipbeat function with selected SkipBeatOffset value. Please note that during this process, if 
any time reset sequence is invoked then SkipBeatOffset will be defaulted to 0x5. 

For manual SkipBeat setting, set SkipBeatEnable=1, (the SkipBeat function will run, completing faster than 
you can do your next register access), then use single step (sample and move) manual skipbeat algorithm. To 
do this, repeatedly (a) sample state of “SbTestaRxCIkN” of R_FlIrxLaneStatus register, and (2) move phase of 
receiver clock 1-bit time by toggling “ForceSkipBeat”. Note where SbTestaRxCIkN transitions 0-to-1 and 1-to-0, 
then do additional skips to position the offset correctly relative to those transitions. After desired phase alignment 
of receiver clock is achived, SkipBeatEnable bit should be cleared. 


2.17.17 Flrx Manual Override Rotator (MOR) 
Register 
R_FlrxMORI7:0] 


Address 
0x0_0000_00b0 - 0x0_0000_00cc (plus base address) 


Manual override or Force Rotator Setting. 
When set, rotator function in framer is disabled and ro- 
tator value specified in RotatorSetting is forced. 


Rotator Setting. 
Note that Rotator setting from 0x9 to OxF are assumed 
to be at value of 0x9. 


How Manual Override Rotator function works? 
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Manual Override Rotator (MOR) function may be activated if automatic Linksync routine fails and failure 
points to rotator function. 


1. To activate MOR on failing lane, select RotatorSetting (between 0x0 through 0x9) and set ManualOverrid- 
eRotator bit. 


2. Next, initiate Linksync routine by accessing R-FltxRxLcControl register. 


3. During discovery of valid RotatorSetting, values in R_FlrxRotator is invalid because it captures rotator setting 
in automatic Linksync routine. However, Lanehealth field of R_-FlrxRxLcStatus will indicate valid status of 
lane health. In MOR, if LaneHealth is true then correct rotator setting has been acquired other wise other 
values of rotator setting should be tried. 


4. During MOR, all fields of R_FlrxRxLcStatus are valid and content of this register should be used to find 
correct rotator setting. 


5. After discovering correct rotator setting re-initiate Linksync routine by accessing R-FlrxRxLcControl register. 


2.17.18 R_FlrxBBDiag 


Register 


R_FlrxBBDiag 


Address 


0x0_0000_00d0 (plus base address) 
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2.17.19 Flrx_BBDiagStatus 


Register 
R_FlrxBBDiagStatus 


May 14, 2014 


Flow control bit-blasting paiteen type. 
(a) 0x0 - repeat driving k28.5 (PNULL) 31 times and 
k28.0 (ANULL) (once) (b) 0x1 - drive PNULL (k28.5 ) 
(c) 0x2 - drive D10.2 (0x4A) 
(d) 0x4 - drive D24.3 (0x78) 
(e) 0x8 - drive IKJPAT pattern to stimulate inter-symbol 
interference (ISI) in ac-coupled system. 
Loop of 484 Character: 
D30.3 (Ox7E) 167 times 
D20.3 (0x74) once 
D30.3 (Ox7E) once 
D11.5 (OxAB) once 
D21.5 (0xB5) 51 times 
D30.2 (0x5E) 
D10.2 (0x4A) once 
D30.3 (Ox7E) 4times 
D30.7 (OxFE) once 
D20.7, D11.7 (OxF4EB) 128 times 
Flow control lane select for bit-blasting pattern. 
When BBEnab and this bit is set, flowcontrol lane trans- 
mits pattern selected by FcPattern field. 
When BBEnable and this bit is clear, flow control lane 
transmits PNULL (k28.5) pattern. 
Receiver bit-blasting pattern type. 
(a) 0x0 - repeat k28.5 (PNULL) 31 times and k28.0 (AN- 
ULL) (once) 
(b) 0x1 - PNULL (k28.5 ) 
(c) 0x2 - D10.2 (0x4A) 
(d) 0x3 - D24.3 (0x78) 
(ec) 0x4 - IKJPAT pattern to stimulate inter-symbol inter- 
ference (ISI) in ac-coupled system. 
Loop of 484 Character: 
D30.3 (0x7E) 167 times 
D20.3 (0x74) once 
D30.3 (Ox7E) once 
D11.5 (OxAB) once 
D21.5 (0xB5) 51 times 
( 
( 
( 


once 


D30.2 (0x5E) once 
D10.2 (0x4A) once 
D30.3 (Ox7E) 4times 
D30.7 (OxFE) once 
D20.7, D11.7 (OxF4EB) 128 times 
Receiver Lane Select for bit-blasting patterns. 
A bit is assigned to each of 8 lanes. 
When set, selected lane is enabled to check bit-blasting 
pattern type selected by RxPattern field. 
When clear, selected lane does not check for RxbbPattern. 
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Attributes 


-noregtestcpu 


Address 
0x0_0000_00d4 (plus base address) 


15:8 | RxLaneSync R x Lane synchronization status. A bit is assigned to each 
lane. 
This bit will be set in BBMode, when corresponding lane 
is selected to check for RxbbPattern and selected lane 
finds RxbbPattern. 
This bit will remain clear in BBMode, if corresponding 
lane is not selected. 


7:0 | RxBBError R x Bit Blasting error. A bit is assigned to each lane. 
This bit will be set in BBMode, if corresponding LaneSync 
is set and then if selected lane detects BBPattern error. 
Otherwise this bit will remain clear. This bit will also 
remain clear if corresponding LaneSelect bit is clear. 


2.18 FLR/FLT Register Allocation 


This chapter instantiates the three copies of the FLR and FLT registers. 


2.18.1 Flr0 
Register 
R_Flr0* : R_Flrx* 


Address 
0xE_0D00_0000-O0xE_ODFF_FFFF 


2.18.2 Flrl 
Register 
R_Flr1* : R_Flrx* 


Address 
0xE_1D00_0000-0xE_1DFF_FFFF 


2.18.3 Flr2 
Register 
R_Flr2* : R_Flrx* 


Address 
O0xE_2D00_0000-0xE_2DFF_FFFF 


2.18.4 FIto 
Register 
R_FIt0* : R_FItx* 
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Address 


0xE_3D00_0000-0xE_3DFF_FFFF 


2.18.5 Fitl 
Register 


R_FIt1* : R_Fltx* 


Address 


0xE_4D00_0000-0xE_4DFF_FFFF 


2.18.6 FIt2 
Register 


R_FI1t2* : R_Fltx* 


Address 


0xE_5D00_0000-0xE_5DFF_FFFF 


Vregs_End_Of_Decl 


2.19 Quad Serdes Physical Media Access (QPMA) 


The AnalogBits SERDES physical media access macro, referred to as QPMA, has quad transmit and receive 
lanes. The transmit and receive lanes within QPMA are identified as X, Y, Z, and W. The QPMA has quad clock 
and data recovery (CDR) logic for quad receiver lanes. The QPMA generates four seperate receiver clocks, one 
for each receiver lane. The receiver clock is in phase with incoming data streams. The QPMA has one PLL which 
generates clocks for four transmit lanes. The QPMA also has the calibration and impedance control circuits for 
quad transmitter and receiver channels. 

Each ICE9 has 3 fabric links and each fabric link has 9 serdes lanes. Hence each ICE9 will use 7 QPMAs to 
constuct 3 fabric links. The Figure-2.9 shows the placement of 7 QPMA in ICE9. They are numbered from 0 
through 6. The QPMA6 supports flow control lanes for each of the three links. The QPMAG6 will have one pair of 
unused transmit and receive lane. 

Following table shows the lane assignemnts in QPMA for each of the three fabric links. 


Pecos [oo PT 
aay eer Gas 


oper [rey 
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Key in AnalogBits macro: 


Bump pitch @ 210um S = serializer 
quad = 1680um x 630um D = de-serializer 
Total = 11,760um x 630um T = transmit PHY 

; R = receive PHY 

Bump pitch @ 200um P = transmit PLL 


quad = 1600um x 600um 


C = Impedance and calibration circuit for Quad 
Total = 11,200um x 600um 


Bump pitch @ 180um ~<a > 
quad = 1440um x 540um approx. 3000um 
Total = 10,080um x 540um Budgeted Area 
= width x height 
Note: = 1500um x 5000um 


1. AnalogBits macro takes 270um area height 
2. Corner Rules of DRC/LVS on top level must be observed. F abri c Swit ch 


(sclk) 


Each bus is 80-bit wide 
(64+3)+(8+2)+3=80 


FLTs have no flops only Comb. gates 
ee 
=< FLR-2 Qsc FLR-1 FLT-1 FLR-0 FLT-0 


(76543210FC Wee daa tore 5463210] 76543210FC} (76543210FC] 7 6543210FC} 7654321QFC 
LWNes v DAW : ie v : et 
udgeted Height || sa a4 2APYY y a4 2a aes PPA aa as PVN 
1000um yr a . | ~ yy i “& 
FLR/FLT FLR/FLT FLR/FLT FLR/FLT FLR/FLT FLR/FLT FLR/FLT 
Sih ju Pit it 0 


Loenanagt Hatt aEetytt NE GME ACs aE opal 


r 
am Each FLT drives (10 mission_mode + 4 calibration control) signals 
SIp|s p}P|s|D|s|p} |s|p}s|DPP|s||s|p}_|s|pJs|p|P|s|p}s}p||s\pls|p|P|s|p\s/p}_|s\ps|p|P|s|p\s|p}_|s|P|s|DP|s|Pjs|p}_|s|pIs|p[P|s|p}s|p| 
TR TRR) RARIITRR) PIRTR[ARITIR ARITIR[ATIRTIR] FARTIR[|TRITIR| FTIRATIR]CTRAPR friR|TR| TR AT|R 
Medial Baba LY |Z] 2 X |X) Y LY [ZZ VW XX Z| Z WI X |X. ZZ xX WAKAIEA VX x yiz} ZI X|X y| ZZ 
QPMAS QPMA4 QPMA6 QPMA3 QPMA2 QPMA QPMAO 


oOd\—_€@£@£- _@ANNRpRpypmS—SH—H8 edge 33006010 


Figure 2.9: QPMA Placement in ICE9 


Fabric Link QPMA | QPMA Lane 


2.19.1 Calibration and Impedance Control of the driver and Receiver 


The QPMA has individual transmitter driver impedance control circuitry. The QPMA also has individual 
receiver impedance calibration circuitry. The details of the driver and receiver control is described in AnalogBit’s 
document “Serdes PMA Programmer’s Reference Manual”. 

ICE9 has the driver and receiver handshake interface with QPMA which is called the Quad Serdes Control 
(QSC). The QSC has 5 registers which are accessible through SCB. Those 5 registers are QscGo, QscCA, Qsc- 
SerDatAR, QscSerDatT, and QscSerDatP. The QscCA holds the address of the target driver or receiver. The 
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QscSerDat* registers hold calibration values for targeted driver and receiver. By writing 1 to QscGo register, the 
QSC will load impedance and calibration values in target driver or receiver. The QSC also has the QscStatus 
register which holds the status of the handshake as described in section-2.20. 


2.19.2 Verification Checklist: 


1. Reset sequence 
2. Transmitter channel impedance calibration 


3. Receiver channel impedance calibration 


2.20 Quad Serdes Control (QSC) Registers 


2.20.1 R_QscGo 
Register 


R_QscGo 


Attributes 


-kernel 


Address 


0xE_6D00_0000 


QscGo RWS Write QSC register for specified QuadSerdes. 
On the transition of 0-to-1, targeted QuadSerdes register 
is written. The target of the QuadSerdes Register is speci- 
fied by R-QscCA register and the data values are specified 
in R_QscSerDatAR, R_QscSerDatT, and R_QscSerDatP 
registers. 


2.20.2 R_QscStatus 


Register 


R_QscStatus 


Attributes 


-kernel 


Address 


0xE_6D00_0004 
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InvalidQadr RW1C Invalid QuadSerdes address. 
This bit is set if QSC was invoked with atleast one invalid 
Quad Serdes address since previous clearing of this field. 


InvalidSubQadr Invalid Sub Quad Address. 
This bit is set if QSC was invoked with atleast one invalid 
Quad address since previous clearing of this field. 
InvalidTarget Invalid Target. 
This bit is set if QSC was invoked with atleast one invalid 


target address since previous clearing of this field. 


[Reeve C~*dY 


pn QSC Siiceees, 
This bit holds status of the previous QSC transaction. 
When set, it indicates that the previous QSC transaction 
was a success. When clear, it indicates that the previous 
QSC transaction was a failure. 

Busy QSC busy. 

Holds status of QSC controller. When set, it indicates 
that QSC controller is busy. 


2.20.3 R_QscCA 


Register 


R_QscCA 


Attributes 


-kernel 


Address 


0xE_6D00_0010 


11:8 QscAdr RW QSC Address. 
Holds address of QSC. There are total of 7 QSC. 
If this field has value greater than 6(hex) then it makes 
invalid QSC address. 


QscSubAdr RW QSC Sub Address. 

Holds sub address of QSC. There are total of 4 subad- 
dresses in each QSC. The encodings of this field is as be- 
low: 

8(hex) - Sub address W 

A(hex) - Sub address X 

2(hex) - Sub address Y 

1(hex) - Sub address Z 
All other encodings (total of 12 of them) makes invalid 
QSC sub address. 


3:0 QscTarget RW QSC Target. 
Holds target of the calibration transaction. 
1(hex) - Tx Driver. 
2(hex) - Rx Receiver 
All other encodings (total of 14 of them) makes invalid 
target. 
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2.20.4 R_QscSerDatAR 
Register 
R_QscSerDatAR 


Address 
OxE_6D00_0014 


17:0 | ASerDatAR | RW aSerDatAR Register. 
Holds 18-bit value tobe written in either aSerDatA of 
aSerDatR register of QSC. 
Refer to (a) section-7.3 of Serdes Programmer’s Refer- 
ence Manual for Transmitter ouput driver settings and 
(b) section-8.3 of Serdes Programmer’s Reference Manual 
for Receiver settings. 


2.20.5 R_QscSerDatT 
Register 


R_QscSerDatT 


Address 
OxE_6D00_0018 


17:0 | ASerDatT RW aSerDatT Register. 
Holds 18-bit value tobe written in aSerDatT register of 
QSC. 
Refer to section-7.3 of Serdes Programmer’s Reference 
Manual for Transmitter ouput driver settings. 


2.20.6 R_QscSerDatP 
Register 


R_QscSerDatP 


Address 
OxE_6D00_001¢ 


17:0 | ASerDatP RW aserDatP Register. 
Holds 18-bit value tobe written in aSerDatP register of 
QSC. 
Refer to section-7.3 of Serdes Programmer’s Reference 
Manual for Transmitter ouput driver settings. 


2.20.7 R_QscQpmaStatus 


Register 
R_QscQpmaStatus|6:0] 


Attributes 


-noregtestcpu -kernel 
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Address 


0xE_6D00_0020 - 0xE_6D00_0038 


16:13 PlUBpStatus R x PLL Bypass Test Status. 
When set, indicates that PLL Bypass Test was successful 
for [W,X,Y,Z] lanes. 
12 ZCompOp R x Impedance calibrator result. 
When 1, Z < nominal. 
When 0, Z > nominal. 


11:8 CdrDiagOut R x CDRDiagOut. 
Lanes are individually controlled. Lane assignment. is 
[W,X,Y,Z]. 
Refer to “Serdes PMA Programmer’s Reference Manual” 
for detailed explanation. 


RefClkStable RefClk (or sclk) stable. 
When set, indicates that sclk is stable. This signal is 
generated by CLK_GEN. 


Set arrare RC aT 


aE dies 1024 sak cycles after ATxClkStable is asserted. 
sips set, it indicates that TxClk is stable. 
ATxClkStable R x Set when transmitter clocks are up and stable for 
ie [W,X,Y,Z] lanes. 


3 ARxClkStableW “ when receiver W-lane clock is bit-locked to incoming 
Pe pacers |p| elSererarrirr 
2 ARxClkStableX Set when receiver X-lane clock is bit-locked to incoming 
ee data Oe 
1 ARxClkStableY Set when receiver Y-lane clock is bit-locked to incoming 
Da Rae A ee 
ARxClkStableZ R x Set when receiver Z-lane clock is bit-locked to incoming 

Da ceed a a bee 


2.20.8 R_QscQpmalImpCalibration 
Register 


R_QscQpmalmpCalibration 


Attributes 


-kernel 


Address 


0xE_6D00_003c 
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2.20.9 R_QscQpmaControl 


Register 


R_QscQpmaControl|[6:0] 


Attributes 


-kernel 


Address 


0xE_6D00_0040 - 0xE_6D00_0058 
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TxPLL Reset. A bit is assigned to each QPMA. Thus 
bit-12 controls QPMAO and bit-18 controls QPMA6. 
When set, shuts down TxPLL and bypasses RefClk (sclk) 
to internal high frequency (1 Ghz) clock. 

ae es a ae 

Selects which circuitry is calibrated. Asynchronous signal. 
The encodeds value are as below: 

0 - Calibration shutdown 

1 - Calib Tx 

2 - Calib Rx 

3 - invalid 

| ss | Reserved. ee 

pe calibration control value. Sela Erna Ta EERE PEER ES Eee sig- 
nal. 
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31:28 | CDRPLLRst RW CDRPLL Reset. Asynchronous signal. 
Lanes are individually controlled. Lane assignment is 
[W,X,Y,Z]. 
When set, shuts down CDRPLL and bypasses RefClk 
(sclk) to internal high frequency (1 Ghz) clock. 
RW 


27:24 | RxPwrDown Receiver power down. Asynchronous signal. 
Lanes are individually controlled. Lane assignment is 
[W,X,Y,Z]. 
When set, the receiver is in power-down mode. This signal 
does not include CDRPLL in power-down mode. 


23:20 IDDQ IDDQ mode. Asynchronous signal. 
Lanes are individually controlled. Lane assignment is 
[W,X,Y,Z]. 
When 1, it is configured for IDDQ mode otherwise the 
normal operation. 
Refer to “Serdes PMA Programmer’s Reference Manual” 
for details. 

19:16 RxTest RxTest mode control over-ride for CDR feedback loop. 
Asynchronous signal. 
Lanes are individually controlled. Lane assignment is 
W,X,Y,Z]. 
Refer to “Serdes PMA Programmer’s Reference Manual” 
for details. 

15:12 | CDRDiagIn CDRDiagIn. 
Lanes are individually controlled. Lane assignment is 
W,X,Y,Z]. 
Refer to “Serdes PMA Programmer’s Reference Manual” 
for details. 


1 


11:8 ForceTxHiZ Force driver in HiZ. Asynchronous signal. 
Lanes are individually controlled. Lane assignment is 
W,X,Y,Z]. 
Its assertion takes precedence over SerTxCtr[1:0] load op- 
eration. 
7:4 ForceRxHiZ Force receiver in HiZ. Asynchronous signal. 
Lanes are individually controlled. Lane assignment is 
[W,X,Y,Z]. 
Its assertion takes precedence over SerRxCtr[1:0] load op- 
eration. 
: LpBkNearEnd When set, NearEndLoopback mode is enabled. Asyn- 
chronous signal. 
Lanes are individually controlled. Lane assignment is 
[W,X,Y,Z]. 
When set, a lane is in NearEndLoopback mode. 
This bit should be 0 for normal mode of operation. 


2.20.10 R_QscQpmatTestControl 


Register 


R_QscQpmaTestControl|[6:0] 


Address 


0xE_6D00_0060 - 0xE_6D00_0078 
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15:12 | TxHFCIkDnB RW OxF Tx HFClk. 
Asynchronous signal. 
Lanes are individually controlled. Lane assignement is 
[W,X,Y,Z]. 
Refer to “Serdes PMA Programmer’s Reference Manual” 
for details. 

11:8 | RxHFClkDnB RW OxF Rx HFCIk. 
Asynchronous signal. 
Lanes are individually controlled. Lane assignement is 
[W,X,Y,Z]. 
Refer to “Serdes PMA Programmer’s Reference Manual” 
for details. 


RxFDIp1 RW RxFDIpl. 
Asynchronous signal. 
Lanes are individually controlled. Lane assignement is 
[W,X,Y,Z]. 
Refer to “Serdes PMA Programmer’s Reference Manual” 
for details. 
3:0 RxFDIp0 RW RxFDIp0. 
Asynchronous signal. 
Lanes are individually controlled. Lane assignement is 
[W,X,Y,Z]. 
Refer to “Serdes PMA Programmer’s Reference Manual” 
for details. 


2.20.11 R_QscInterrupt 


Register 


R_QscInterrupt 


Attributes 


-kernel -writeonemixed 


Address 


0xE_6D00_0080 
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7 Intr RW1C Interrupt signal from Fl. Interrupt signal goes to CSW 
and it is named fl_csw_Int_sa. 
This bit is set if IntEnab is set AND any one or more of 
the FltIntr or FlrIntr bits are set. 
To clear this bit, first clear all FltIntr and FlrIntr bits that 
are set, by clearing Intr bits in the registers listed below, 
then write-1 to this bit. 


| 6 | IntEnab | RW |] 0 | | Overall interrupt enable from Fl module. 


5:3 FitIntr R x Fltx Interrupt status. 
A bit is assigned to capture interrupt status of each Flt 
module 2,1, and 0. 
The FltIntr bit is set for a specific Fltx when one or 
more of the Intr bits in R_FltxInvCFc, R_FltxDispFc, or 
R_FltxHeartbeat are set. 

2:0 FlrIntr R x Flrx Interrupt status. 
A bit is assigned to capture interrupt status of each Flr 
module 2,1, and 0. 
The FlrIntr bit is set for a specific Flrx when one or more 
of the Intr bits in R_-FlrxHeartbeat, R_-FlrxLaneInvC[7:0], 
or R_FlrxLaneDisp[7:0] are set. 


2.20.12 Qsc TxBBDiag 


Register 


R_QscTxBBDiag 


Address 


0xE_6D00_0090 
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— Ces 


HI] 


i - P| f. 


2.20.13 Qsc Lane Status Register 
Register 


R_QscLaneStatus 


Attributes 


-noregtestcpu 


Address 


0xE_6D00_0094 


May 14, 2014 


Bit Blasting mode enable. 
When reset, 10-bit code sent to this transmitter will re- 
main at logic level low or at value 0x0. 
Transmitter bit-blasting pattern type. 
(a) 0x0 - repeat driving k28.5 (PNULL) 31 times and 
k28.0 (ANULL) (once) (b) 0x1 - drive PNULL (k28.5 ) 
(c) 0x2 - drive D10.2 (0x4A) 
(d) 0x3 - drive D24.3 (0x78) 
(e) 0x4 - drive IKJPAT pattern to stimulate inter-symbol 
interference (ISI) in ac-coupled system. 
Loop of 484 Character: 

D30.3 (Ox7E) 167 times 

D20.3 (0x74) once 

D30.3 (Ox7E) once 

D11.5 (OxAB) once 

D21.5 (0xB5) 51 times 

D30.2 (0x5E) once 

D10.2 (0x4A) once 

D30.3 (Ox7E) 

D30.7 (OxFE) once 

D20.7, D11.7 (OxF4EB) 128 times 
Transmitter lane select for bit-blasting pattern. 
When BBEnable and this is set, lane is enabled to drive 
bit-blasting pattern as selected by TxPattern field. 
When BBEnable and this bit is clear, selected lane drives 
PNULL (k28.5) pattern. 


Atimes 
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pisaa [Rotator [| R |< | | lancrottorwe—SSOSCSC~C~“~S*S~S~S~S~S~S~*S 
se 
a 


CdrPllLock CdrPLL lock Sate. Holds lock status of CDRPLL of 
unused receiver in QPMA. 


a 


SCNT Test aARXCIKN signal. 
It holds sampled value of aRxCIKN signal from Qpma. 


SbSuccess SkipBeat Success. It indicates status of last skipbeat op- 
eration. 
When set, indicates that SkipBeat function has been suc- 
cessful. 

epee baie Active. 
aS set, indicates that SkipBeat operation is active. 

Poca R x State of SkipBeat First search function. 

es When set, indicates that the First Search is completed. 


SbSecondSearch R x State of SkipBeat Second search function. 
When set, indicates that Second search is completed. 


SbFinalSearch R x State of SkipBeat Final search function. 
When set, indicates that Final search is completed. 


SbAdjust R x State of SkipBeat Adjust function. 
When set, indicates that Adjustment is completed. 


2.20.14 Qsc Lane Control Register 


Register 


R_QscLaneControl 


Attributes 


-kernel 


Address 


0xE_6D00_0098 
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ClrLaneHealth RW Clear lane health. 
For every transition of 0-to-1 of this bit, lane health bit 
of lane is cleared. 


Pa SP 2 


Force Sipbeat. 

For every transition of 0-to-1 of this bit, RxClk offset is 
skipped 1-bit time. This field is intended to be used in 
manual setting of RxClk. This bit should be clear after 
manual setting of RxClk is completed. 

Skip Beat Enable. 

At the transition from 0-to-1, SkipBeat function is exe- 
cuted once using value selected in “SkipBeatOffset”. 
SkipBeat Offset. 

The receiver RxClk offset is equal to “SkipBeatOffset” bit- 
time wrt sclk. 

The power-on default value is 5(hex). 

This field is 4-bit wide and SkipBeatOffset can be selected 
from O(hex) to 9(hex). The values in this field are modulo- 
10. 

For applying newer value of SkipBeatOffset, SkipBeatEn- 
able should be toggled. 


2.20.15 R QscRxBBDiag 
Register 
R_QscRxBBDiag 


Address 
OxE_6D00_00a0 


(a [| BBinab [RW | 0 | [Bi Bhstingmodeombla——SCS—~SCS 


3:1 | RxPattern RW Receiver bit-blasting pattern type. 
(a) 0x0 - repeat k28.5 (PNULL) 31 times and k28.0 (AN- 
ULL) (once) 
(b) 0x1 - PNULL (k28.5 ) 
(c) 0x2 - D10.2 (0x4A) 
(d) 0x3 - D24.3 (0x78) 
(e) 0x4 - IKJPAT pattern to stimulate inter-symbol inter- 
ference (ISI) in ac-coupled system. 
Loop of 484 Character: 
D30.3 (0x7E) 167 times 
D20.3 (0x74) once 
D30.3 (Ox7E) once 
D11.5 (OxAB) once 
D21.5 (0xB5) 51 times 
D30.2 (Ox5E) once 
D10.2 (0x4A) once 
D30.3 (Ox7E) 4times 
D30.7 (OxFE) once 
D20.7, D11.7 (OxF4EB) 128 times 


RxLaneSel RW Receiver Lane Select for bit-blasting patterns. 
When set, selected lane is enabled to check bit-blasting 
pattern type selected by RxPattern field. 
When clear, selected lane does not check for RxbbPattern. 
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2.20.16 R QscRxBBDiagStatus 
Register 
R_QscRxBBDiagStatus 


Attributes 


-noregtestcpu 


Address 
OxE_6D00_00a4 


1 RxLaneSync R x Lane synchronization status. 
This bit will be set in BBMode, if RxLaneSync e is se- 
lected to check for RxbbPattern and finds RxbbPattern. 
This bit will remain clear if BBMode is not selected. 


RxBBError R x Bit Blasting error. 
This bit will be set in BBMode, if RxLaneSync is set and 
then if selected lane detects BBPattern error. Otherwise 
this bit will remain clear. This bit will also remain clear 
if RxLaneSelect bit is clear. 


2.21 Link Unit Implementation Interface 


Following sub-sections list handshake signals to and from link unit. 


2.21.1 Interrupt Interface 


The “fl_csw_Int_sa” is interrupt generating output signal from link unit. All Link interrupts are communicated 
by asserting this output. Refer to CSR section-2.20.11 for further details on interrupts from link unit. 


2.21.2 Serial Configuration Bus Interface 


The fabric link registers are acessible through the SCB interface. To connect to the SCB, a module must 
instantiate an SCB slave module, and connect it to a global SCB chain. The input is connected to chaini_scbs_dat_sr 
and the output is connected to scbs_chaino_dat_sr. The SCB bus and the SCB slave module are documented in the 
serial configuration bus chapter. 


2.21.3 Differential Drivers and Receivers 


A link unit drives 27 serial signals on 27 differential drivers and it receives 27 serial signals on 27 differential 
receivers. The differential drivers and receivers are part of Analogbit’s QPMA and are described in Analogbit’s 
document “Serdes PMA Programmer’s Reference Manual”. 


2.21.4 Fabric Switch Interface 
Following 8B10B characters will be used on serial lanes. 
1. k28.0 (byte = 8’hlc) - alternate NULL character 
2. k28.1 (byte = 8’h3c) - SOLS, start of LinkSync used by LinkSync hardware execution routine 


4. k28.3 (byte = 8°h7c 


) us 
) = 
3. k28.2 (byte = 8’h5c) - EOLS, end of LinkSync used by LinkSync during by hardware execution routine 
) - SOP, start of packet, used during MissionMode operation 
) i 


( 
( 
( 
5. k28.4 (byte = 8’h9c) - EOP, end of packet, used during MissionMode operation 
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6. k28.5 (byte = 8’hbc) - NULL or IDLE character 


7. Following 8B10B control characters are Reserved. These 6 characters will be verified as part of data pattern 


verification cycle in LinkSync hardware execution routine but they are not used in Sicortex serial lane protocol. 
k28.6, k28.7, k23.7, k27.7, k29.7, k30.7 


The Figure-2.10 shows handshake signals between fabric switch and link interface at the transmitter and at the 
receiver. The figure assumes that ICE9-A is the link transmitter of data packets and link receiver of control packets 
and ICE9-B is the link receiver of data packets and link transmitter of control packets. 


ICE9-A ICE9-B 
FabricSwitch TransmitLink ReceiveLink FabricSwitch 


firx_fsw_InDat_s0a 
(64-bit) 
flrx_fsw_SoP_s0a 
firx_fsw_EoP s0a 
firx_fsw_Idle_s0a 
firx_fsw_DatVal_s0a 


fsw_fltx_OutDat_s2a 
(64-bit) 
fsw_fltx_SoP_s2a 
fsw_fltx_EoP_s2a 
fsw_fltx_Idle_s2a 
fsw_fltx_DatVal_s2a 


DataLink 
(8 lanes) 


firx_fsw_MissionMode 


fltx_fsw_MissionMode 


fltx_fsw_CtlDat_s0a fsw_flrx_CtlDat_s3a Pee 
(8-bit) (8-bit) 
fltx_fsw NewCtlPkt sda [La fsw_flrx_NewCtlPkt_s3a 


ControlLink 
(1 lane) 


fsw_flrx_DatVal_s3a <q 


fltx_fsw_DatVal_s0a 


Figure 2.10: Handshake Signals 


The data packets and IDLE packets, which are collectively referred to as data packets, begin at SwitchFabric of 
ICE9-A and travel from FabricSwitch to TransmitLink of ICE9-A on 64-bit wide data bus (fsw_fltx_OutDat_s2a). 
The TransmitLink transmits 64-bit data on 8-lane wide serial link which is connected to ReceiveLink of ICE9-B. The 
ReceiveLink of ICE9-B transfers 64-bit wide databus (flrx_fsw_InDat_s0a) and handshake signals to FabricSwitch 
of ICE9-B. 

Correspondingly, the control packets begin at SwitchFabric of ICE9-B and travel from SwitchFabric to Re- 
ceiveLink of ICE9-B on a 8-bit wide databus (fsw_flrx_CtlDat_s3a). Then the link interface of ICE9-B transmits 
control and/or data characters on a single serial lane which is connected to ICE9-A. The link interface of ICE9-A 
will transfer 8-bit databus (fltx_fsw_CtlDat_s0a) and handshake signals to FabricSwitch of ICE9-A. 


2.21.5 The transmitter Handshake Ports 


Signal Name 
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fltx_fsw_MissionMode | TransmitLink | FabricSwitch | When clear: 
(a) TransmitLink is down and not available for transmit- 
ting data or control packets. 
(b) FabricSwitch must not assert fsw_fitx_Dat VaLs2a sig- 
nal. If fsw_fltx_DatVal_s2a signal is asserted and if 
fltx_fsw_MissionMode is clear, then transmitter link may 
drive unpredictable characters on serial link causing un- 
predictable behavior at the receiver. 
When set: 
(a) TransmitLink is up, available for transmitting data. 
(b) If fsw_fitx_DatVal_s2a signal is clear then Trans- 
mitLink will drive either NULL (k28.5) or alternate 
NULL (k28.0) characters on all 8 lanes. An alternate 
NULL (k28.0) character will be driven every 8th cycle of 
fsw_fltx_Dat Val_s2a signal remaining deasserted. 
(c) If fsw_fltx_DatVal_s2a is set then TransmitLink 
will transmit either control or data characters on se- 
rial lanes. The status encodings of fsw_fltx_SoP_s2a, 
fsw_fltx_EoP_s2a, and fsw_fltx_Idle_s2a are valid if among 
those 3 signals, condition of mutual exclusion is met and 
the rest of the bytes in FORD are data bytes 


fsw_fltx_Dat Val_s2a FabricSwitch | TransmitLink | Data Valid signal. 

When set: Indicates that control signals fsw_fltx_SoP_s2a, 
fsw_fltx_EoP_s2a, fsw_fltx_Idle.s2a, and data bus 
fsw_fltx_OutDat_s2a[63:0] are valid. 


When clear: Control signals and data bus are invalid. 


fsw_fltx_OutDat_s2a FabricSwitch 64-bit data bus, FORD = {Byte7,Byte6,..Byte0} 


fsw_fltx_SoP_s2a FarbricSwitch | TransmitLink | Start of packet, ignores BYTEO and sends control char- 
acter k28.3 on laneO. Ignore start of packet if either 
fsw_fltx_EoP_s2a or fsw_fltx_Idle_s2a is also set. 

fsw_fitx_EoP_s2a FabricSwitch | TransmitLink | End of packet, ignores BYTEO and sends control char- 
acter k28.4 on laneO. Ignore end of packet if either 
fsw_fltx_SoP_s2a or fsw_fltx_Idle_s2a is also set. 

fsw_fitx_Idle_s2a FabricSwitch | TransmitLink | Idle packet, ignores BYTEO and sends control character 
k28.5 or k28.0 on laneO configured in CSR: 2.16.6. 
Ignore Idle packet if either fsw_fltx_SoP_s2a or 


fsw_fltx_EoP_s2a is also set. 


fltx_fsw_CtlDat_s0a TransmitLink | FabricSwitch | 8-bit databus 


fltx_fsw_NewCtlPkt_sOa | TransmitLink | FabricSwitch | When set, indicates marker for new control packet and 
databus fltx_fsw_CtlDat_s0a = 8’h7c 


fltx_fsw_Dat Val_sOa TransmitLink | FabricSwitch | When fltx_fsw_MissionMode is clear then, 
(a) Fltx_fsw_DatVal_s0a will remian deasserted 
(b) fltx_fsw_CtlDat_s0a and fltx_fsw_NewCtlPkt_s0a are 
invalid and should be ignored. 
If fitx-fsw_MissionMode is set then, 
(a) If fltx_fsw_DatVal_s0a signal is clear then it in- 
dicates that on serial lane neither valid data nor 
SOP was detected and hence fltx_fsw_CtlDat_sOa and 
fltx_fsw_NewCtlPkt_s0a must be ignored. 
(b) If  fltx_fsw_DatVals0a is set if 
fltx_fsw_NewCtlPkt_s0a is set then it indicates marker 
for new control packet (fltx_fsw_CtlDat_s0a = 8’h7c) 


otherwise fltx_fsw_CtlDat_s0a has valid data byte. 
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2.21.6 The Receiver Handshake Ports 


Signal Name 


flrx_fsw_MissionMode ReceiveLink | FabricSwitch | When clear, 
(a) ReceiveLink is down and not available for receiving 
data and transmitting flow control packets. 
(b)  flrx_fsw_DatVal_sOa_ signal will remain de- 
asserted. Rest of the handshake _ signals, 
flrx_fsw_SoP_s0a, flrx_fsw_EoP_s0a,  flrx_fsw_Idle_s0a, 
and flrx_fsw_InDat_s0a are invalid and must be ignored. 
When set, 
(a) If flrx_fsw_DatVal_sOa is clear then the rest of the 
handshake signals are undetermined (may be ignored). 
(b) If flrx_fsw_DatVal_s0a is set then rest of the handshake 
signals are valid. 
(c) Flow control signals fsw_flrx_* are valid signals. 


flrx_fsw_Dat Val_s0a ReceiveLink | FabricSwitch | If flrx_fsw_DatVal_s0a signal is clear, then rest of the 
handshake signals (flrx_fsw_SoP_s0a, flrx_fsw_EoP_s0a, 
flrx_fsw_Idle_sOa, flrx_fsw_InDat_s0a) are undetemined 
(and hence may be ignored). 
If flrx_fsw_Dat Val_s0a is set, then rest of the handshake 
signals are valid. 
Following five conditions will drive firx_fsw_DatVal_s0a 
signal to logic state 1. 

Byte7 through Bytel have valid data) & Byte0 has 

control character k28.0 
(2) (Byte7 through Bytel have valid data) & Byte0 has 
control character k28.3 
(3) (Byte7 through Bytel have valid data) & Byte0 has 
control character k28.4 
(4) (Byte7 through Bytel have valid data) & Byte0 has 
control character k28.5 
(5) (Byte7 through Byte0 have valid data) 


fsw_flrx_CtlDat_s3a FabricSwitch 8-bit databus 


fsw_flrx_NewCtlPkt_s3a | FabricSwitch | TransmitLink | When  fsw_flrx_NewCtlPkt_s3a is set, it indicates 
marker for new control packet and the databus 
fsw_flrx_CtlDat_s3a is ignored. Serial lane will drive con- 
trol character k28.3 on serial lane when this signal is set. 


fsw_firx_Dat Vals3a FabricSwitch | TransmitLink | When firx_fsw_MissionMode is clear then, 
(a) ReceiveLink is down and not receiving FORDs or 
transmitting control packets. 
(b) FabricSwitch must not assert fsw_filrx_Dat VaLs3a sig- 
nal. If fsw_flrx_DatVaLs3a signal is asserted then link 
may drive unpredictable characters on link causing un- 
predictable behavior at the receiver 
When firx_fsw_MissionMode is set then, 
(a) If fsw_flrx_DatVals3a signal is clear then Trans- 
mitLink will drive either NULL (k28.5) or alternate NULL 
(k28.0) character on a serial lane. 
(b) If fsw_flrx_DatVal_s3a signal is set then status encod- 
ings of the rest of the handshake signals are valid. 
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Chapter 3 


The Dense Fabric Switch 


[Last Modified $Id: fabric.lyx 43331 2007-08-15 18:23:44Z wsnyder $] 


3.1 


Overview 


Each node chip contains a buffered crossbar switch which forms the basic element from which the SiCortex 
communication fabric is built. The switch is designed to provide the necessary components of a degree three Kautz 
network, though it would be well suited to building a 3-dimensional torus, fat tree, butterfly network, or other 
commonly used topology. 


3.1.1 Specifications 


3 input links, 3 output links. Input and output links do not, in general, connect to the same nodes. 


2 GBytes/sec per link. 8 data lanes per link, plus a forwarded clock and a reverse channel which carries flow 
control information. 


Signaling: Max frequency 1 GHz, DC-balanced 8/10 code. 

Best case transit time through idle Fabric Switch: 15 ns. Transit time is measured from when start of packet 
is flopped-out by Link till when flopped-in by Link. Transit time increases for Dma-to-Link or Link-to-Dma 
packets, or if the downstream switch is congested. 

Maximum packet size: ~160 bytes (128 byte payload). 

Virtual Channels: 16. 


Buffers: 16 packets at each crosspoint. 


Ordering: Any packets with the same source node, destination node, and VC, following the same path, must 
remain in order. 


3.2 Differences, Bugs, and Enhancements 
3.2.1 Product and Chip Pass Differences 
1. None. 


3.2.2. Known Bugs and Possible Enhancements 


1. 


The FSW has an architectural performance limit preventing 4 ford packets at max rate, bug1832. 
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3.3 Description 
3.3.1 Routing 


Packets are assigned fixed routes through the fabric by the originating node. The first FORD of each packet 
header contains a string of 2-bit routing codes. (See section 3.4.1.) At each hop, the receiving node examines the 
first routing code of the string, which selects among the three fabric link output ports or an escape. If one of the 
outputs is selected, the string is shifted right by 2 bits (one code) before transmission to the next node. 

Routes are only shifted on packets traveling from IB to OB, not for packets going to or from the DMA. 


3.3.2 Virtual Channel Assignment 


Each packet is assigned a virtual channel (VC) number when it is created; the VC is chosen according to the 
path the packet is to follow, and determines the set of buffers available to the packet at each switch node along 
its path. The VC is encoded in the packet header. Each switch node has a programmable function which is 
able to conditionally decrement the VC on packets passing through the port. This function is enabled at fabric 
configuration time, and specifies for each port and each virtual channel whether or not to decrement VC. 

Why do we do this? We use VCs to prevent deadlock in the network. Imagine a network of three nodes 
connected in a ring. Node A sends packets to node B, B to node C, and C to node A. Each node can forward a 
packet (pass it through) or consume it. 

Assume that each node has space for exactly one packet in its input buffer. When a packet arrives in the input 
buffer it is examined and either passed along or consumed. A packet is never passed along from one node to the 
next unless the sending node knows there is space available in the receiving node’s input buffer. If node A wants 
to send a packet to node B, then it holds the packet in A’s input buffer until there is free space in B’s input buffer. 
Then the data is sent, and A’s input buffer is made free. 

Now imagine that node A wants to send a packet to node C. It will send the packet first to B when B’s input 
buffer is empty. B will forward the packet to C when C’s input buffer is empty. Further complicate things by 
imagining that at the same time A wants to send packet to C, B wants to send to A, and C wants to send to B. 
In the first “cycle,” B will recieve A’s packet, C will receive B’s packet, and A will receive C’s packet. Notice what 
happens in the next cycle. A wants to forward the packet it just got from C and put it in B’s input buffer. But 
B’s input buffer is filled — it holds a packet destined for C which is stuck at B because C’s input buffer is filled. 
Nobody moves. We’re stuck in a deadlock. 

Note that we could add more input buffers at each node, but that would just postpone the problem. If we had 
two input buffers, we could lock up the network by making sure we send two packets from each node to the node 
two hops away. All that it would take is for one node to delay emptying a destination buffer or some small network 
delay and all the buffers would fill. The problem is that there is a resource dependency that wraps around in a 
cycle. A can’t be free until B is free, but B can’t be free until C is free, and C can’t be free until A is free, but A 
can’t be free until... 

This deadlock was a showstopper for many complicated topologies until Dally and Sites described a scheme 
called “virtual channels” in a 1988 paper. In this scheme they proposed adding “extra” buffers at each network 
node, but divided the buffers into classes. That is, buffer number 0 was devoted to virtual channel 0, buffer N to 
virtual channel N. Next they designated a specific virtual channel for every packet. A given packet would travel 
on this virtual channel from its source to its destination. We could apply this scheme to our three node system by 
saying that all messages starting at node C will be sent on virtual channel 1, while all messages from any other 
node will travel on channel 0. This breaks the circular dependency, since though A can’t be free until B is free 
and B can’t be free until C is free, C’s destination (A’s input buffer for channel 1) is never blocked since channel 1 
carries all messages starting at C. (I wish I could do a movie of this one. Ask me to show you on the whiteboard 
if this doesn’t make sense — mhr.) 

So the first step in applying the virtual channel idea to our network is to identify all the cycles. It is the cycles 
that we want to break. But we have a network with 972 nodes. How many cycles are there? Probably a bazillion 
or so. Identifying them would be a bit of an issue, so here’s what we do: Number all the nodes from 1 to 972. Each 
node K has three links coming from some other nodes (call them A,B, and C) and three links going to other nodes 
(call them R,S, and T). Each of these nodes has a number. A node is “less than K” if its node number is less than 
K’s node number. Now consider a circular path through a bunch of nodes. Each of those nodes has a number. 
For at least one node P in the cycle P’s upstream node in the cycle (the node that connects to P’s input) and P’s 
downstream node in the cycle (the node that connects to P’s output) are BOTH less than P. (Draw a cycle of five 
or six nodes and number them. Note that you can’t arrange the numbers such that there isn’t some node P that 
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fits our description. If you could, then you could build a building such that climbing the staircase would eventually 
bring you back to the bottom of the staircase.) 

So, now for every path through a node (from one of its three inputs to one of its three outputs) we can identify 
which paths fit our criteria of the upstream node and downstream node both being “less” than this node. There 
is at least one such path in every cycle within our network. Remember that the idea of virtual channels is to use 
buffer assignments to “break the cycle.” The original VC concept assigned a VC to a packet at the start of its path 
and the VC was constant for the entire tour. We add a twist. We start a packet on some VC X. Each time it passes 
through a node on the route such that the upstream and downstream nodes are both less than the current node, 
we decrement the VC. That breaks the cycle. ! 

The last remaining trick is to make the initial VC assignment to each packet such that the VC doesn’t get 
decremented so many times that it falls below 0. (We provide 16 virtual channels in the SiCortex fabric architecture.) 
The likely method we'll use is to count the number of times the VC will be decremented on a particular route, from 
start to finish. It may never be decremented. It can, at most, be decremented no more than (L-1)/2 times for a 
route L hops long. So, for a network with a diameter of 7, we need no more than 3 virtual channels. We provide 
more than 3 so that some traffic can travel on channels 0,1,2,3 and other classes of traffic can travel on 4,5,6,7. ’'m 
not sure why anymore. 

The fabric switch can support 16 VCs, but if fewer VCs are needed, the extra buffers can be configured as a 
pool that is available to traffic on any VC. The PoolMask register specifies which buffers are dedicated and which 
are in the common pool. If 6 VCs are needed for a system configuration, set PoolMask to 0xFFCO and only use 
VCs 0-5. The 16-bit value 0xFFCO indicates that crosspoint buffer entries 0-5 are dedicated to VCs 0-5, and entries 
6-15 are pool. 


3.3.3. Virtual Channel Arbitration 


So now we’ve got every packet assigned to some virtual channel. (And, we’ve noted, the VC may change as the 
packet flows through the network.) To avoid the network deadlock we need to provide a separate buffer on each 
node chip for each virtual channel. In fact, we go one better than this. 

In our simple example of the three node ring, we had a set of VC buffers at each network input port. In the 
ICE9 chip, we have a set (16) buffers for each (input,output) pair. That is, traffic arriving at a node’s port 0 and 
leaving on port 2 will go into a pool of buffers that is separate from traffic for any other pair of input and output 
ports. We call the crosspoint where traffic from input port X to output port Y a crosspoint buffer. 

Each crosspoint buffer has a pool of 16 packet buffer entries. One crosspoint buffer (XB) is associated with each 
input port and output port pair, and the XB keeps track of: 


1. the order in which the packet in each buffer arrived, and 
2. the virtual channel to which that packet is assigned. 


When a packet is in the XB it waits until the XB knows that there is space for it in the downstream node. For 
example, let’s say that a packet arrives on port 0 of node K and is destined to leave on port 2. (We know this 
from looking at the routing instruction.) We also know from the routing instruction that when the packet gets to 
the downstream node D it will leave on — for example — port 1. Then the XB02 (the crosspoint buffer receiving 
data from input port 0 and sending it out on port 2) on node K looks at the “buffer busy mask” for XB?1 on the 
downstream node.?In our example, let’s say that the packet is traveling on VC 3. XBO2 asks “Is slot #3 in XB?1 
on node D empty?” If so, then XB02 can send the packet on to node D and be assured that there is a place for the 
packet to go. In fact, since XBO2 knows that the packet will go into slot 3, it sets the XBE_ENTRY field in the 
outgoing packet to tell node D to store the packet in slot #3. 

As I noted above, there are 16 slots in the XB packet store. We only use 6 to 8 virtual channels. Slots 0 through 
N in the XB are dedicated to VCs 0 to N. (N is defined in the POOLMASK that is set via the FSW POOLMASK 
configuration register that is reachable from the CSR interface. See section 3.9.) 

At all times, the XB has a conservative estimate of the available buffers in each of four XBs of the downstream 
switch (strictly, the available buffers in the XBs for the input port on the downstream node to which this XB’s 
output link is connected). If there are free buffers in the pool (buffers not assigned to a specific virtual channel), 
the output port selects the oldest packet among its buffers (if the age is known only among packets from the same 


lWe’re in the process of patenting the scheme I’ve described here, so please, no matter how dull the conversation might get at your 
next party, keep this whole story within the SiCortex community. 

21 say XB?1 with the ? because we don’t know the input port number that the message arrived on at node D. As it turns out, we 
don’t care. K’s XBO02 can only cares about the four XB’s in D that are connected to port 2 on node K. 
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input port, the output port should select among occupied input port buffers on a round-robin basis). If there are 
no free buffers in the pool, then only those packets for virtual channels known to have available buffers should be 
allowed to arbitrate, and the oldest such packet should be chosen. Once a packet is sent from an output port, the 
output block tells all XBs connected to it that the assigned buffer is busy. 


The local picture of which downstream buffers are busy is maintained in the output block. Buffers get added 
to this list when they are sent downstream. Buffers get freed from this list when a control packet arrives from 
the downstream node with a new “link sequence number.” We don’t use packet-by-packet ACKs to signify correct 
reception, as this would be rather inefficient. Instead, the downstream link continuously sends control packets. One 
field in each control packet carries the sequence number of the last correctly received packet. The upstream node 
then frees up any entries in its local list of busy buffers that were consumed by the acknowledged packets. 


But the local picture is not complete. The downstream node includes a set of “busy masks” in the same packet 
with the last good link sequence number. There are four such masks, one for each of the downstream XB’s connected 
to this link. So, the output block maintains four local busy masks and receives four downstream busy masks. For 
each downstream XB, the OR of the local and the downstream mask yields a conservative picture of which buffer 
entries are free on the downstream node. The local busy mask contains a 1 for each packet in the replay buffer that 
hasn’t been acknowledged. As soon as the packet is acknowledged, the local busy bit is cleared. 


3.3.4 Flow Control 


At the link level, each transmitter assigns a link sequence number (LSN) to every outgoing packet, and includes 
that number in the header. The receiver includes the most recently received sequence number (of an error-free 
packet) in its buffer status reports flowing up the reverse channel. The transmitter (which receives the buffer 
status reports) retains transmitted packets in a replay buffer, deleting a packet from the replay buffer when the 
downstream node indicates that it has been successfully received. 

Of course, since we’re using the LSN as an acknowledgement mechanism, we have a bit of a startup problem. 
Imagine that the transmitter sends its first packet with an LSN of 0. Now imagine that the first packet is corrupt. 
Here’s the problem: the downstream node probably sent a control packet to the upstream node even before the 
first packet (LSN = 0) arrived. That control packet had to have something in the “last good LSN received” field. 
If it was 0, then we’re already fouled up and we haven’t even sent a whole packet yet. So, we start the transmitter 
at LSN=2 and start the receiver’s last good LSN register at zero. 


3.3.5 Error Control 


First, a point to remember: a single-cabinet system has 972 nodes, each with three links consisting of 10 signal 
pairs, for a system total of 29160 signal pairs. Operating at 2x10°9 bits per second, we have approximately 6x10°13 
bits transmitted and received per second, so for any practical bit error rate, our system will encounter signaling 
errors hundreds of times per second. It is therefore essential that we recover quickly and gracefully from the vast 
majority of them. 

Packet error detection and recovery is performed at the link level. Each receiver calculates a packet checksum, 
and verifies it against the checksum provided by the transmitter. The receiver’s status reports back to the trans- 
mitter include the sequence number of the most recently received packet, if there has been no error, or the last 
packet received prior to an error if there has been an error. 


In the event of a detected error, the receiver notifies the transmitter of the error, and the transmitter re-sends 
all the packets following the last one correctly received. The output port contains a replay buffer (16 packets in 
size), and in the event of an error, rather than re-arbitrating for the packets in the crosspoint buffers, packets replay 
from the appropriate point in the replay buffer. This means that crosspoint buffers can be released at the time 
they win arbitration for the output port, and do not have to wait for correct receipt. It results in an estimated 
error recovery time of about 70 ns. 


When operating smoothly, the fabric will achieve cut-through, meaning that the header of a packet will leave 
a node’s output port before any error is detected, so a faulty packet may propagate through the network to its 
destination. To deal with this problem, the type code in the tail of the packet includes a poison” code which is set 
at the node which first detects the error, and causes the packet to be discarded when it arrives at a destination. 
Packets that develop uncorrectable ECC errors while stored in a packet buffer will also be tagged as “poison.” 
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To/From Fabric Link Receivers (flr0, flr1, flr2) 
(InDat<63:0> from, CtlIDat<7:0> to) 


AAAA rvvy AAAA 
XBO00 IBO XBO02 
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XB10 IB1 XB12 
AAAA AAA 
XB20 IB2 XB22 
A AAAA 
XB30 XB32 
b=) q 
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= = = 
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To/From FIt0 (Fabric Link Tx 0) To/From Fit1 To/From DMA Engine To/From Fit2 
(OutDat<63:0> to, (OutDat<63:0> to, To: OutDat0<71:0>, OutDat1<71:0>, OutDat2<71:0> (Dat<63:0> to, 
CtlDat<7:0> from) CtlDat<7:0> from) From: InDat0<71:0>, InDat1<71:0>, InDat2<71:0> CtlDat<7:0> from) 


Figure 3.1: Fabric Switch Block Diagram 


3.3.6 Out-of-Band Channel 


Both upstream and downstream channels carry a specialized out-of-band fields in the packet which deliver 
one byte plus handshake information to the immediate neighbor. The byte is deposited in a software-accessible 
register in the neighboring switch’s control registers, and the handshake bits maintain the full/empty status of the 
corresponding register in the source node. Whenever the handshake bits from the upstream or downstream nodes 
change value, an interrupt may be requested. There will be six such registers in each node, one for each upstream 
and downstream neighbor. This mechanism allows bidirectional out-of-band communication between neighboring 
nodes. It will be used at least for software configuration and management of the fabric, and we can implement 
TCP/IP on it if needed. (See sections 3.5.1, 3.4.1, and 3.4.2.) 


3.4 Operation 


When a packet arrives at the input port of the node (see ?? on page ??) it is re-timed into a sequence of 64 bit 
FORDs delivered at 1/5 the clock rate of the fabric (and is resynchronized/forwarded into the node’s own Switch 
Clock domain.) All recoding and re-timing is handled by the node link receiver. The switch will then route the 
incoming packet to an appropriate output port via one of 15 crosspoint buffers. 

The first FORD in each incoming packet (arriving at an input block or IB) contains the virtual channel number 
for the packet and the number of the port from which it will leave the switch. Ports 0, 1, and 2 lead to the three 
neighbor nodes, while port 3 leads to this node’s DMA engine. The input block may decrement the VC number in 
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the header FORD (see 3.6) before sending the packet on to one of four crosspoint buffers. (This is based on the 
deadlock avoidance mechanism described below.) The link recognizes the first FORD in a packet by the presence 
of a start-of-packet marker in lane 0 of the link. 


When a packet arrives at a crosspoint buffer, it is written into a free crosspoint buffer entry (XBE) at the 
location specified in the first FORD of the packet. (If the location is already occupied, the packet is dropped and 
marked “invalid.”) Arriving packets are immediately bypassed to the output block if the crosspoint buffer is not 
otherwise sending data to the OB. If the output block is not currently moving a packet, it will pass the bypassed 
packet directly to the link output port. Otherwise, the packet will sit in the XBE until it wins a bid for transmission 
and is read from the crosspoint buffer. The last FORD in each packet contains an end-of-packet marker in its 8/10 
encoded form on lane 0. 


The output block is responsible for global arbitration among the packets offered by each of the four attached 
XBEs. We want to make sure that packets can be routed back to back, so that if XBOO is sending a packet through 
the output block (OBO), then we can immediately follow the end of its packet with the start of a packet from any 
of the XBEs. We do this by starting global arbitration a few cycles before the end of a packet is transmitted, so 
that a winner is ready in time to fill the next output cycle. 


A few elements are not shown in the diagram because they touch nearly everything. The fabric switch contains 
a module FswCsr which connects to the Serial Control Bus (SCB). Control register values are distributed from 
FswCsr to every module, and status values are sent from every module back to FswCsr. The FswCsr can assert 
interrupts to notify processors of an error conditions or when an out-of-band character is received. 


3.4.1 The Data Link 


The data link is implemented with eight lanes of SERDES, differential, low-swing, channels, each passing bits 
between Ice-9 chips at 10-times Ice-9’s internal clock rate. Inside Ice-9, on the interface between each Link unit 
and the Fabric Switch unit, one 64-bit Ford is passed on each clock. Each external lane handles 8 bits out of those 
64 bits. A series of these Fords represents the flow of Data Packets or Idle Packets. 


On any given interface between the Fabric Switch unit and a Link unit, when the signal indicating “valid data” 
is asserted, Data Packets or Idle Packets are passing. The Fords themselves don’t indicate boundaries between 
packets, separate control signals say what each Ford is. The “SOP” signal indicates the Header, the first Ford of a 
Data Packet. The “EOP” signal indicates Trailer, the last Ford of a Data Packet. The “Idle” signal indicates this 
Ford is an Idle Packet (Idle Packets are 1 Ford long). Only one of these three signals may be asserted at a time, 
and if none are asserted, this Ford is in the middle of a Data Packet. 


The format of a Header Ford, a Tailer Ford, and how they join with payload Fords to form a Data Packet are 
shown below. Also shown is the format of an Idle Packet. 


The encodings of Sop, Eop, and EsComma fields are only meaningful in the 10-bit form, while on a differential 
link between Ice-9 chips. Inside Ice-9, between Links and Fabric Switch, we rely on separate SOP, EOP, and Idle 
control signals. A TX Link will ignore whatever value the Fabric Switch put in Ford bits 7:0 during the assertion 
of SOP, EOP, or Idle, and just manufacture the appropriate 10-bit control character to send over the differential 
link. An RX Link will put 8-bit values into Ford bits 7:0 when it receives Sop, Eop, or EsComma 10-bit characters, 
but these 8 bits are ambiguous, being the same as certain other ordinary data bytes. Fabric Switch knows these 
are not ordinary data bytes because of the control signals. 

Although Link units are free to represent EsComma with either NULL or ANULL characters, which according 
to the 8b/10b encoding we use would produce different 8-bit values when decoded, the RX Link always puts the 
8-bit value for NULL into bits 7:0 when decoding either NULL or ANULL. 


Crc32 is created in a manner that doesn’t include Sop, Eop, and EsComma fields, and when judging incoming 
Crc32’s, those fields are again excluded. 


3.4.1.1 Fabric Packet Header Class 
Class 


FswPktHdr 
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Attributes 
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Por] [Sop | | —*(iStartofPacket ———SOSOSCS~S~SCSCSCSCSCSC~*d 


w 
w0[11:8] | Ve 


a haa i 


Virtual Channel <3:0>. 


w0[15:12] | XbeTarget | =|  ~—__—*|-XBE Target <3:0>. 
cc 7 
w0[26:22) | NumFords | | ~—s|: How many fords in the packet? Valid values are 4 to 20. 


This bit is only used by the DMA engine. If 1, the DMA 
treats the second ford as a DMA control ford, otherwise 
it is treated as payload. 


w0[31:28} | Lsn = | | ss Link Sequence Number <3:0>. 
wo65:52| [Route | |__| Route <a. 


w0[63:0] | AllBits | |__| Header doubleword. Overlaps allowed. 


3.4.1.2 Fabric Packet Trailer Class 


Class 
FswPktTrail 


Attributes 


Pwo] [op |__| ___| Bnd of Packet. 


Pw0lrts| [Type |_| 
w0lT5:19] [Procesindex [|_| 


|__| Packet type (1111 = FSW_POISON_TYPE). 


Process Index. 


wO[BETG] | UnixProcessid [|__| UNIX ProcessID__——SSSSSSS—S—S—SCSr 
PwOlGase] [Oreae— Cd OROCCOCSCSCSCSCSCSC*S 


w0[63:0] | AIBits | =| ss Trailer doubleword. Overlaps allowed. 


3.4.1.3. Fabric Data Packets 


ee ee Header (as shown above) 
1 to (last-1) Payload, including software header 


last Trailer (as shown above) 


3.4.1.4 Fabric Packet Idle Class 


Whenever an output port has no data packets to transmit, it sends an Idle packet, consisting of a single ford 


encoded as shown. 


Class 
FswPktIdle 


Attributes 


Pwr] _[FsComma || ——*d 


ES_COMMA (K28.5 NULL or K28.0 ANULL). 


Pw05:8] [Outed |_| | OutofBand Byte SSS 
Pw06) —[BimptyFlag [|_| Empty fag SSSSSSOSOSCSCSCSCSSC*?r 
Pw0lr7] _[TakenFlag [|_| Taken flag. ——SOSCSCSCSCSOCCSCS 


wO[T8} | ErrorAck [|_| | Brror Acknowledge. 


PP Recerved OSS 


w0foa9] [Craze 
w0[63:0|] | AIIBits | =6{ ss AI bits of Idle Packet. Overlaps allowed. 
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3.4.2. The Control Link 


The control link has one lane, one differential pair between Ice-9 chips, 8-bits wide between the Fabric Switch 
unit and Link units. Control Packets are 15 bytes long. 

Between Ice-9 chips the start of a Control Packet is indicated by the 10-bit SOP character. Between Fabric 
Switch and Link units, the start of a Control Packet is indicated by a NewCtlPkt control signal. This is because 
the 8-bit encoding of the 10-bit SOP character is ambiguous, being the same as another ordinary data byte. A Link 
unit sending a Control Packet will ignore b0. A Link unit receiving a Control Packet will put the 8-bit encoding of 
SOP into bO0, and assert NewCtlPkt at that time. Note that CSUM intentionally does not cover the SOP field. 


3.4.2.1 Fabric Control Packet Class 
Class 
FswCtlPkt 


Attributes 


[Deion ——SCSCSCSCSCSCSCSCSC~*ds 


aa of deer Start of Packet. During CRC computation, assume 
SoP=0. 


pene] [esa sx 
Reseed SSOSOSOSOSOSCSCSCSCSC—‘—~* 
Peng [| [EvfagSSOS—SSSCCC~CSY 
[TakenFlag [|_| Taken fag —SSCSC—SCS 
PEmpiyFlag[ [| Empty fag_—SS—S—S—S—SCSCSCSCSCS 
PPoBusyMi |_| | PoBuyfiss]——SSCSC~C~S—SCSSCCCCCC‘( 
PPoBusybo [|_| PoBusyfro] —SCSSSSCCCC~*”Y 
PPiBusyMi_| |__| PiBusyliss]_——SOSCSCSCSCSCSSSSC*‘d 
PPiBusybo [|_| PiBusyfro] SSCS 
PP2BusyMi_| |__| P2Busyliss]_——SSSCSCSCSSCCCCCCC‘d 
PPxBusybo |_| | P2Busyfro] SCS 
PPsBusyMi_| |__| PsBusyliss]——SSSOSCSC~CSCSCSCCCCCCC‘*'d 
PPsBusybo [|_| PsBusyfr] SCS 


ro WSOOB oe St dl 


ce 

Stee Crc3 Running CRC of bytes 0-10. During CRC computation, 

= _|— | ~ aera 
Cre2 Running CRC a: bytes 0-11. During CRC computation, 

eee | eee 


Crel Running CRC of bytes 0-12. During CRC computation, 
assume Crcl=0. 


Crc0 Running CRC of bytes 0-13. During CRC computation, 
assume Crc0=0. 


3.4.3. Control Link Use 


It is a good idea, in the case of a critical flow control scheme, to assume that packets will be dropped, corrupted, 
spindled, mutilated, or otherwise ill treated on their way from source to sink. 

As we discussed earlier, flow control is managed by a debit/credit mechanism where the receiving end tells 
the sender how much space is available in the receiver’s buffers for each virtual channel and port. As you might 
imagine, this scheme works fine in the presence of imperfect knowledge on the part of the transmitter, as long as 
the transmitter’s view of the world is always pessimistic: the transmitter must never send a packet for which there 
is no room at the receiver. 

The receiver will provide this imperfect information by keeping up a continual chatter on the control link. Each 
control packet begins with an ES_COMMA tenbit character. Section 3.4.2.1 describes the layout of the control packet. 
Each control packet will contain the serial number of the last packet received without error. We'll call this the 
“link sequence number” or LSN. As each packet arrives intact on the data link, the receiver updates the LSN and 
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it is sent back in the next control packet. If an error is detected in an arriving packet, the receiver will not update 
the LSN and will set the Err_Flag entry in the control packet. (The LSN field holds the link sequence number 
from the last successfully received packet.) The error flag remains set in all subsequent control packets until a data 
link message arrives indicating an error recovery retransmission. (This is done via the Error Acknowledge bit in 
Idle packet). The other bits sent along with the LSN are used to manage the out-of-band communication channel 
described below. 

In addition to the LSN, the control chatter needs to update the availability of up to 16 virtual channels and 
a shared buffer pool for each of the four outlets at the end of a data link. (See 3.7 for a discussion of buffer 
allocation in the fabric switch.) Each outlet for a switch has 16 buffer slots. The interpretation of a buffer slot 
name vs. the virtual channel to which it belongs is programmable — the meaning is determined by agreement 
between the downstream and upstream nodes on a link. The control link protocol only specifies the means of 
identifying the occupancy for each of the 16 entries in each of the four crosspoint buffers. The control packet 
stream carries a current snapshot of the crosspoint buffer entry utilization for each of the four crosspoint buffers. 
Each XB (crosspoint buffer — see 3.10.7) has 16 entries. The arbitration unit within the switch determines which 
packets may be forwarded to the next node on a path based on the availability of downstream buffer entries. If bit 
N is set in PxBusy, buffer slot number N on crosspoint buffer x is currently filled. 

Finally, we provide a “out-of-band” communication link between nodes that travels along the control link. The 
out of band link is described in section 3.5.1. 


3.4.4 Error Recovery 


Note that we’re not doing error correction. It turns out that error correction on a 10/8 code is rather expensive. 
Parity based schemes (including most SECDED codes) rely on the likelihood of a single bit error being much greater 
than a multi-bit error. Unfortunately, it is unlikely that we could construct a mapping from the tenbit space into 
the eightbit space that preserves the error bit count. That is, for some ten-bit combination, there will be a single 
bit error in the encoded symbol that will result in two or more bits in error for the decoded symbol. Consider a 
symbol with an equal number of 1’s and 0’s. Each of the ten possible single bit errors will create a new symbol with 
either six 1’s and four 0’s or vice-versa. Each of these ten distance one symbols will decode to some legal eightbit 
value. At least one of those must differ in at least two bits from the original (correct) decoded value, because there 
are only eight values that are distance 1 from the original value. So the first alternative is a symbol correcting code 
that could correct one bad symbol out of 255. The cost of the symbol correction hardware really isn’t worth it. 

Simple linear codes are hopeless here. So we’ll use a CRC error detection code. 

Every data packet is protected by a CRC error-detection code, and every output port has a replay buffer in 
which it records the data packets recently sent on that link (Idle packets are not recorded). The connected input 
port records the Link Sequence Number (LSN) of every packet, and as long as the CRC’s are correct, sends the 
LSN’s back to the output port via periodic control packets. In the event of a CRC error, the input port stops 
updating the returned LSN and instead reports an error in the control packet. The output link uses the LSN of 
the last correctly-received packet to look up the position of the erroneous packet in the replay buffer, and re-sends 
the corrupted packet and all its successors. When the downstream node receives the retransmitted packet, if the 
CRC is correct it updates the returned LSN, and the output link resumes taking packets from the switch when it 
has finished retransmission from the replay buffer. 

Control packets and idle packets are never replayed. The switch that generates these packets creates new packets 
with a new CRC constantly. The switch that receives these packets simply ignores packets which have a bad CRC 
and waits to receive the next one. 


3.4.5 Poison 


When the header of a packet arrives at an input port, the switch immediately arbitrates for use of the selected 
output port, and if it’s available, begins outputting the packet. This is called cut-through routing, and is an 
important contributor to the performance of the SiCortex fabric. However, it creates a problem if the packet 
contains errors that aren’t apparent from the first ford; for example, a packet length error cannot be detected until 
we realize that the packet is too long. There is no way to prevent the header from continuing on to its destination 
— or if corrupted, some other destination altogether. The solution to this problem is to “poison” the packet. Any 
node which detects an error will change the Packet Type field in the packet’s last ford to Poison (and increment a 
counter of how often it has poisoned packets). When the packet finally arrives at some destination, the Poison type 
will be recognized and counted, but the packet will be otherwise ignored. Switch nodes which buffer a poisoned 
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packet waiting for an output port are permitted to discard the entire packet, provided that it has not begun output. 
For more detail on error recovery, see section 3.8. 


3.4.6 Mission Mode 


The fabric switch depends on the fabric link transmitter (FLT) and receiver (FLR) to send data to its neighboring 
nodes. While the FLT and FLR are being initialized and the link is in training, each link deasserts a signal to the 
FSW called MissionMode. While MissionMode is off, the fabric switch ignores everything else coming from that 
link, to avoid being confused by the training sequences. Once MissionMode is asserted by a FLR, the switch begins 
to accept data packets and send control packets. When MissionMode is asserted by an FLT, the switch waits for 
the first good control packet, then begins sending data packets downstream.® 


3.5 Special Communication Paths 


3.5.1 The Out-of-Band Communication Registers 


It is quite handy to have a low bandwidth simple communications path between an upstream and downstream 
node. Normally the network topology would not allow communication from a downstream node B back to its 
upstream node A without requiring a message to pass through multiple hops. 

Half of the sub-band communication path is implemented on the control link, the other half is in the data link. 
The control link carries OOB information in every control packet; the data link carries OOB information in Idle 
packets, when the link has no data packets to transport. 

Each node has six OOB links: three to its upstream neighbors, and three to its downstream neighbors. Each 
OOB link is bidirectional, with a send and receive register at each end of the link. Let’s assume that Empty is 
high and Taken is low. To use the link, software writes a byte to the send register, and clears the Empty flag. The 
fabric transmits the register and flag to the far end of the link as convenient (in Idle or Control packets), and when 
they are received without error, the far-end fabric switch writes both to the receive register. The receive register 
requests an interrupt when it sees the Empty flag toggle. Interrupt software on the far-end node reads the receive 
register, and sets the Taken flag, which then gets passed over the reverse channel and causes an interrupt on the 
source node. To return to initial conditions, software on the source node sets Empty again, which propagates and 
triggers an interrupt at the receiver. Then software on the receiver clears Taken, which propagates and triggers an 
interrupt at the source. 

The Out-Of-Band communication is driven entirely by software, so other communication protocols may be 
possible as well. 


3.6 Deadlock Avoidance 


The fabric uses a virtual channel scheme to avoid network deadlock. For more information on virtual channels 
and deadlock avoidance, see Section 3.3.2. The fabric switch core makes no changes to virtual channel assignment 
for a packet. It is the responsibility of the input block to decrement the virtual channel assignment per the deadlock 
avoidance scheme. 

For instance, consider a packet arriving at receiver block 0 on virtual channel 3 on a route that dictates a 
decrement of the VC number and will leave the chip on port 1. The upstream node has already verified that 
crosspoint buffer XBO1 has room for a packet on VC3. The incoming packet will consume the appropriate slot 
in XBO1 and arbitrate for access to a crosspoint buffer entry on the next chip for VC2. The choice of whether 
a packet’s VC is decremented is made in the input block for a port. Each IB has a 3-bit DecrementVc register 
indicating which (if any) packets get a VC decrement based on the output port selected by the routing field at 
the head of the packet. If bit X of the DecrementVc register is set, then VC is decremented for packets whose 
destination is output port X. DecrementVc has only 3 bits because packets going to the DMA (output port 3) are 
never decremented. 


3The link always delays the assertion of dat Valid two cycles after the assertion of mission mode. The DV infrastructure follows this 
implementation. 
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3.7 The Switch Architecture 


3.7.1 General Organization 


When an outbound link becomes available, each crosspoint buffer set (four crosspoint buffers connected to the 
same output port) must pick the best eligible crosspoint buffer entry to send out. The “best eligible” entry is, 
ideally, the oldest. Finding the oldest entry of 16 within a single crosspoint buffer is relatively straightforward and 
inexpensive. We call this stage “Local Arbitration.” Once each of the four crosspoint buffers in a set (e.g. XBOO, 
XBO1, XBO2, and XBO3 in Figure 3.1) has chosen a local candidate, the four local candidates bid against each 
other in the “Global Arbitration” phase. 


3.7.2 Ordering Requirements 


The global ordering rule dictates that packets from the same source, going to the destination along the same 
route, with the same virtual channel must be kept in order. 

Another way to state the same ordering rule is: If there are any differences in the route or VC number, packets are 
allowed to pass each other. Our fabric switch does not bother to compare all the bits of route though; it only looks 
at the least significant four bits, which indicate the destination port in this fabric switch and the destination port in 
the downstream fabric switch. This is an implementation choice; there are other legal choices. Our implementation 
of the fabric switch keeps packets in order only if they 


1. arrive on the same input port 

2. leave on the same output port (route bits 1:0) 

3. are destined for the same output port of the downstream switch one hop away (route bits 3:2) 
4. leave on the same virtual channel 


To maintain this ordering, every crosspoint buffer must keep a record of the relative age of all of its packets. We 
never need to compare packet age between crosspoint buffers, because they are on different routes. 


3.7.3. Local Arbitration: Within A Crosspoint Buffer 


When there is an opportunity for a packet to be sent out of an output port, each crosspoint buffer contending 
for that port selects its oldest eligible packet and sends a “bid” to the output port for that packet. Packets are 
eligible if there is a buffer in the downstream fabric switch which can accept the packet. The following paragraphs 
describe how the oldest eligible bidder is determined. 

Each entry in the crosspoint buffer has a 16-bit wide “age vector” associated with it (where “16” is the number 
of entries in a crosspoint buffer). When a new packet arrives with XbeTarget = W, slot W is filled, and its age 
vector is set to all 1s except for bit W. At the same time, bit W in ALL the age vectors within this crosspoint 
buffer are cleared. 

Only “eligible entries” are allowed to bid in a local arbitration cycle. An entry X is eligible if the busy mask 
bits from the output port indicate that there is a buffer entry Y at the destination link that can accommodate the 
packet in entry X. (That is, entry X — carrying a packet for port P and virtual channel V on the next node — is 
eligible only if the next node has space for a packet in XB?P.) 

One cycle after each eligible entry has bid, each entry ANDs its age vector with the vector of bids. If the AND 
of the two is zero, then the corresponding entry wins the local arbitration. Only one such entry can occur for any 
given bid cycle. 

A crosspoint buffer performs local arbitration in every cycle to select a local winner. If there is a local winner, 
the crosspoint buffer raises a request to its output block. The request consists of the following information about 
the local winner: 


e Which type of request it is. There are two types of requests: 


— Request for Packet Store. If granted, the crosspoint buffer will start reading its memory and start 
sending the packet, one ford at a time, to the OB. 
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— Request for Bypass. If granted, the OB will read packet data from its bypass delay pipeline and send it 
out; the crosspoint buffer doesn’t have to send anything. Bypass is only possible during a window of 3 
cycles after the start of packet arrives, but during that window the bypass path provides a lower latency 
path through the switch. 


Virtual Channel. (Remember that any VC decrement has already been done in the IB.) 


e Which output port will the packet use in the downstream switch? 


Which crosspoint buffer entry will the packet use in the downstream crosspoint buffer? 


e How many fords in the packet? 


The crosspoint chooses a local winner every cycle based on continually-changing information from several sources. 
The OB receives control packets and forwards the downstream buffer availability to the crosspoint buffer. When 
requests are granted, the winning packet is invalidated so that it doesn’t arbitrate anymore. New packets arrive 
and begin to compete for a chance to be the local winner. Because the inputs are changing every cycle, the local 
winner may change every cycle and this is perfectly legal, but there’s one caveat. When the grant signal comes 
from the output block, it always refers to the local winner that generated a request in the previous cycle. 


The output block will ignore any requests that are made at incovenient times, such as during the grant cycle, 
during replay, or when the outbound link has gone down. 


3.7.4 Global Arbitration: Between Crosspoint Buffers 


In the previous section we saw that the crosspoint buffers will make requests to the output block in every cycle. 
Each OB sees at most four requests from the four connected crosspoint buffers. Whenever the output port is free, 
or is close to the end of a packet, global arbitration looks at the requests and decides what packet will be sent next. 


Before describing global arbitration, we must consider what packets are really competing for. Before a packet 
can be sent downstream, the switch must be sure that an appropriate buffer is available for it in a downstream 
switch’s crosspoint buffer. So, packets are competing for a spot in a particular crosspoint buffer; the packet’s low 
2 bits of route tells which crosspoint buffer they need to go into when they get there. Also, within a downstream 
XB, some XBEs are dedicated to a VC while others can be used by any VC (see PoolMask register). In global 
arbitration, we try to ensure fair access to the downstream buffers. If requests from different XBs request the same 
NextPort (low 2 bits of route) and VC, they are contending for the same pool of buffers and must be treated fairly. 


Global arbitration is done in two stages. The first stage selects the least recently chosen XB which is requesting. 
The first stage winner’s NextPort and VC are used in the second stage. In the second stage, we only consider 
requests that have the same NextPort and VC as the first stage winner. Often, that narrows it down to just one, 
but there might be up to four requests remaining that all have the same NextPort and VC. The second stage 
does round-robin arbitration between remaining requests, based on just the history of requests with this same 
NextPort/VC combination. The winner of the second stage will be selected to go out the output port as soon as 
possible. The XB that wins is recorded in the stage 1 and stage 2 history so that it influences the next global arb. 


The following diagram describes the state that is stored to implement the two arbitration stages in one output 
block. 


4When the packet arrived in the IB, the low 2 bits of route told which output block to send it to. After looking at them, the IB 
shifted those 2 bits away. By the time the packet is in the crosspoint buffer, the low 2 bits of route tell which output block it will go 
to in the downstream switch. 
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Global Arbitration Example 


Stage 1 Contenders Stage 1 Arbitration Stage 1 Result 
Three out of four XBs request. Find Least Recently Chosen XB that 
The requests are shown, along with is requesting. Use a 4x4 Age Vector Matrix. The stage | winner is: 


their VC and NextPort. 


XB12, VC=0, NP=3 


Any requests with the same 
VC and NextPort will proceed 


XBO02, VC=4, NP=7 XBO02 


XB12, VC=0, NP=3 XB12 fo fo LA 0 | 


to stage 2. 
[* 
XB32, VC=0, NP=3 XB32 bi] 4 o | 
“4 
This table shows the XB12 won least 
recently, followed by XB22, then XB02. 
XB32 won most recently. Eliminate the 
rows and columns for the one that is not 
requesting, and look for a row full of zeroes. 
XB12 is the winner. 
Stage 2 Contenders Stage 2 Arbitration Stage 2 Result 
There are two contenders Do Round Robin by NextPort+VC Last Gnie there wasausiage 2 orb for 
which had the same VC and NextPort using a 64-entry Table of which XB VC=0, NP=3, the table says that XB12 won. 
as the stage 1 winner. won last. 


Give it lowest priority this time. 


ve 0, nextport 0 | XBO2 XB32 is the winner. 
XB12, VC=0, NP= ve 0, nextport 1 | XB32 
ve 0, nextport 2 [BOQ 
XB32, VC=0, NP= ve 0, nextport 3 XB12 J) XB32, VC=0, NP=3 
SSD La” 


ve 1, nextport 0 


Last step: Update the stage | age vector 


: : and stage 2 history table. 
ve 14, nextport 3 | XB22 


ve 15, nextport 0 | XB22 
ve 15, nextport 1 | XB12 
ve 15, nextport 2 | XB02 
ve 15, nextport 3 | XB32 


In the first stage, we need to know the least-recently-used crosspoint buffer that is requesting, so we maintain 
four 4-bit age vectors. The NextPort and VC of the first round winner are used to index into the stage 2 table, 
which records the previous winner for each combination of NextPort and VC. The second stage round-robin gives 
the previous winner the lowest priority in winning stage 2 this time. After a winner is chosen, the stage 1 age vector 
is updated, and the winner’s XB number is stored in the appropriate entry of the stage 2 history table. 

After a winner is chosen, the output port sends a Grant signal to the XB saying that the packet was selected to 
be transmitted. The XB knows that the grant applies to the request from the previous cycle. The XB clears the 
Valid bit on the entry that won, so that a new packet can begin to use that entry. If the request was a “Request 
for Packet Store”, then the XB needs to start shipping data to the XB in the following cycle. 

Back in the output block, the new winner has declared its XbeTarget along with the request, so the OB can 
set a bit in its local pessimistic view of downstream buffer availability. The local pessimistic view is logical ORed 
with the buffer busy mask sent to the crosspoint buffers, so that future requests will assume that the buffer is 
taken. Eventually, the OB receives an acknowledgment in a control packet, the pessimistic bit is cleared, and the 
downstream buffer can be used again. 

When a winner is chosen, the OB knows how long the winning packet is and when it will be done. Therefore, it 
knows when to allow global arb to run again, just in time to select the next winning packet. Meanwhile, all requests 
are simply ignored by the OB. There is no reason that the XB needs to know if global arbitration is running or 
not. The OB only sends the XB a message if it wins. 

When you combine local arbitration and global arbitration, it is easy to introduce the possibility of starvation. 
In several earlier implementations of output arbitration, we discovered cases where a certain traffic pattern in some 
XBs could prevent a packet in another XB from ever getting sent out. One important aspect of the scheme described 
above is that there is separate round-robin history maintained for the specific resources that packets are competing 
for. The NextPort and VC are used in stage 2 because each entry of the table exactly describes the set of buffers 
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that a packet needs. Using only the VC in stage 2 arbitration would allow a flood of traffic on (VCO, NextPort 0) 
to starve traffic on (VCO, NextPort 2). Using only the NextPort in stage 2 arbitration would allow a flood of traffic 
on (VCO, NextPort 2) to starve traffic on (VC1, NextPort2). Another important piece is that every requester must 
receive information on downstream buffer availability in the same cycle, so that if a buffer becomes available that 
allows an old packet to finally go out, it is guaranteed a chance to win local arb and eventually win global arb as 
well. Providing information at the same time is easy for the crosspoint buffers; we need to be especially careful 
in making bypass decisions. If the bypass decision logic learns of available buffers before the competing crosspoint 
buffers, bypass packets could starve normal traffic and break ordering and fairness rules. 


3.7.5 Why Two Levels of Global Arbitration? 


Matt wrote this section to describe some of the pitfalls of global arbitration schemes that didn’t take into 
account NextPort and VC. Bryce left it in the spec because it describes one of the most important problems we’re 
trying to avoid. 


A single-stage “least recently chosen” scheme is fair, but not immune to livelock. Imagine that there is a packet 
X in XBO0 that needs VC1 on port 3 in the next chip. At the same time all four XBs (XB00, XBO1, XB02, XBO3) 
are extremely busy and always have packets that are eligible to bid even when packet X can’t. Now imagine that 
every time XBO00 wins global arbitration, VC1 is busy, but XBOO has traffic for some other VC. This will ensure 
that every time VC1 becomes available to XBO00, it is the least likely bidder to be chosen. (It wasted its turn on the 
traffic for the other virtual channel: the global winner is always XB01,XB02, or XBO03 when there is space available 
in VC1/P3 of the next chip. The packet in XBOO for VC1 will never win the global bidding: it is stuck. Unlikely? 
Yes. Impossible? No. In fact, this bug has surfaced before. 

The problem is that we’re doing a two level arbitration where success for an individual requester requires success 
at both levels simultaneously. In the case of packet X, it won its own local bid whenever it was eligible (because it 
was eventually the oldest packet entry in XBO00) but each time it got to bid on a global resource, other traffic in 
XBOO0 had caused the least-recently-chosen token to pass it by. 


3.7.6 Stitching it all Together 


It is important that we be able to string packets back-to-back through an output port. This means that as the 
last bits of a packet are being sent to an output port driver, we need to have the first bits of the next packet queued 
up and ready to go. To accomplish this, we have tuned the global arbitration logic so that it chooses a new winner 
several cycles before the data is needed at the output mux. By arbitrating several cycles before the end of packet is 
transmitted, we cover the delay of arbitration, notifying the winning XB, and starting to read the winning packet. 
This implementation does not require skid buffers. 


3.8 Error Detection and Recovery 


There are several places where bits could get flipped, slipped, spindled, or mutilated. As was indicated above, 
we attempt to isolate errors to the link level and retry in the presence of bit errors on the link. Some errors are 
recoverable, in the sense that we can retry the transmission and will get the bits across the link on the second or 
third try. Other errors may not be correctable in this way. In this latter case, we will “poison” the outgoing packet 
so that it will eventually be dropped into the bit-bucket somewhere along its future path. 

It should be noted that some errors may be detected well after the packet has begun its trip to the next node 
on its path. That is, the head of a packet may have left a node before the tail has been seen and an error has been 
detected. This is a problem, and probably the only really tight path in the switch. The IB must detect an error in 
the last FORD (possibly a CRC error) and propagate a signal down to the OB in time to cause the OB to change 
the packet type in the last FORD of the packet to “POISON.” This is serialized with the creation of the CRC field 
that is connected to the same packet. Note that we must generate good CRC for ALL transmitted packets, whether 
poisoned or not, otherwise we'll trigger a retry on the link, since all CRC mismatches are assumed to be caused by 
link transmission errors. 

Nearly all error detection occurs in the input block, so that crosspoint buffers and output blocks do not have 
to worry about error conditions. The input block catches protocol errors such as missing or extra SoP or EoP 
markers, packets that are too long or too short to be legal, and CRC errors. Corrupted XbeTarget is detected in 
the crosspoint buffer so that avoid overwriting a good packet with a corrupted one. ECC errors are detected as a 
packet store or replay buffer entry are read. 
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3.8.1 CRC Generation and Checking 


All packets (control, data, and idle) are covered by a 32 bit CRC. The algorithm is defined in Cre.sp. (This is 
the CRC-32 scheme.) The initial value for the CRC sum, before the first byte or word is cranked in, is OxFFFF ff. 
The final value is NOT complemented before being written into the packet. All bits in the packet are covered by 
the CRC except for the SoP field from the link and the CRC value itself. (In the case of data packets, the top 32 
bits of the last FORD are the CRC field, the low 32 bits are covered by the CRC.) 

Many of the errors below are a subset of a CRC error. But some fields are more likely to confuse the switch 
than others. Bit errors in the payload are easy to handle. For the fields that the fabric switch really cares about, 
such as the VC, XbeTarget, and route, the recovery mechanism (if needed) is described in a separate section below. 


3.8.2 Handling Poisoned Packets 


Here is the problem: 

Imagine a packet that is corrupted such that, while it had been traveling on VC 3, the VC got changed to VC 4. 
This is definitely a bad thing, since our deadlock avoidance mechanism depends on VCs monotonically decreasing. 
Any error that causes a VC to decrease in level is tolerable. Errors that bump the VC up can cause a deadlock. 

But the deadlock is only an issue for packets that make it into the packet store. If a packet is bypassed from 
one node to another, then there was no buffer contention to cause a deadlock. Packets that are stored however, 
have the opportunity to negotiate their way into a deadly embrace. 

For this reason, when we write a packet into a packet store, we examine the packet type. If the CRC is good 
and the packet type is poisoned, we immediately free the packet from the buffer. 

This will not prevent a poisoned packet from traveling through the fabric, but it will prevent such a packet from 
locking up a packet store slot, which is the source of our potential deadlock. 


3.8.3 Transient Bit Errors on the Link 


Packets that are corrupted while traveling over the internode link will be resent by the upstream node. This is 
how. 

Consider two nodes at either end of the link. U is the upstream node, transmitting data to D, the downstream 
node. Each packet sent by U carries a serial number (the LSN, or Last Sequence Number) that increments with 
each newly transmitted packet and is 4 bits wide. As U’s output block for this link (OB) sends each packet, it will 
write the packet to a replay buffer. The replay buffer is indexed by LSN. 

D, the downstream node, checks each packet as it arrives for errors. If a packet arrives without error, D loads 
the packet’s LSN into the “Last Good Sequence Number” register in the link’s input block (IB). At some time in 
the very near future, the current value of the Last Good Sequence Number will be sent back up the control link 
from D to U in a control packet. 

When each control packet arrives at U, OB will examine two fields. If the Error bit in the first byte of the 
packet is clear, then the LGSN from the downstream node will be sent to the replay buffer. The replay buffer will 
release all packets up to and including the LGSN, as they have been acknowledged by the downstream node. 

If D detects a CRC error, the IB will enter the Error Detected state and will ignore all incoming packets while 
in this state. The IB will set the “Error” bit in all outgoing control packets sent to the upstream node. 

U will eventually receive a control packet (whose CRC checks out) that has the Error bit set. This tells U that 
all packets after the LGSN in that packet were ignored and must be resent. Before beginning the retransmission, 
U will send IDLE packets with a bit set in the IDLE FORD indicating that the Error is being acknowledged. The 
OB will then wait until it sees a control packet from D that has a clear Error bit. 

The IB on node D will see at least one IDLE FORD with the error acknowledge bit set. IB will leave the “Error 
Detected” state, clear the Error bit in the first byte of outgoing control packets, and await resumption of the packet 
stream from U. 

U will then receive a control packet that has the Error bit clear. This is the completion of the link error 
handshake. The OB on node U will begin sending packets out of its replay buffer beginning with the LSN after the 
LGSN that arrived in the most recent control packet. Once it has resent all the packets in its replay buffer, it will 
resume normal operation. 

Note that packets are only freed from the replay buffer after U has received some positive acknowledgment from 
D via a control packet. The LGSN field tells U that all packets up to and including LGSN have been received 
correctly. The replay buffer can hold 16 packets, but in fact the OB stops transmitting if it contains 15 packets. 
The entry corresponding to the LGSN that arrived in the most recent control packet must not be used, or the 
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acknowledgment protocol becomes ambiguous. Example: If LSN3 was received last, and an OB sends data packets 
4 through 15 and then 0 through 3, it can’t tell if the next control packet acknowledging LSN3 has acknowledged 0 
entries or all 16. To avoid this confusion, the OB would only send data packets 4 through 15 and 0 through 2 and 
then wait, avoiding LSN3 because it equals the LGSN. 

When the replay buffer is filled, the OB inhibits global arbitration so that no more packets are sent out. 
Normally, the replay buffer should never fill up, as the round-trip latency from U to D and back again to U is short 
enough that slots in the OB will be freed up more quickly than they are consumed as long as there are no errors 
on the control link.® 

If the retransmission fails, we’ll keep attempting to retry. Retry events are counted and can cause an interrupt 
when the count exceeds a preset threshold. The link logic also maintains counts of framing errors and symbol 
translation errors. These are handled in the link control logic. 


3.8.4 Corrupted VC 


Either the VC got corrupted as part of a CRC error that we'll find when the EOP comes along, or it was 
corrupted at some previous stage and is the result of scrubbing the CRC for a poisoned packet that was generated 
by recovery mentioned above. (All packets get good CRC when they’re sent out, even if they’ve been poisoned.) 

Sooner or later, some bit error will corrupt a VC. There are two ways we can find that the VC has been 
corrupted. 


1. We are supposed to decrement the VC and it is already 0. This is flagged as a VcDecrError and the packet 
is dropped by the input block. It would be dangerous to allow the packet to continue on to other nodes, 
because it has broken the VC decrement rule that allows the fabric to be deadlock-free. In this case the CRC 
is good, and we must NOT cause replay because the replayed packet would have the same problem. 


2. The buffer index points to a buffer belonging to VC x (as opposed to the free pool) and the packet is traveling 
on VC y. In this case, we know the VC is broken and the CRC will not match. (the VC got corrupted on the 
wires.) Then we use the normal CRC mismatch recovery mechanism to ask for a retransmission. The IB will 
find the CRC mismatch. The packet store should free the buffer. 


3.8.5 Corrupted Route 


If the route was corrupted on the link, the corruption will cause a CRC mismatch, in which case a poisoned 
packet will be delivered to somebody — probably not the intended recipient. In this case the link will retry the 
transmission and a good — non-poisoned — packet will be resent. The retry packet will get to the ultimate destination. 
The poisoned packet will wander around for a while and either get delivered to some destination — where it will be 
discarded as a poisoned packet — or it will arrive at a node where the VC will be decremented from 0. In this latter 
case, the packet will be routed to the DMA engine as described in 3.8.4. 

As in the case of a corrupted VC, the route could have been corrupted at an earlier stage or as the result of a 
flipped bit in the switch (e.g. error in the packet store). In this case, the packet carrying the corrupted route will 
be poisoned. In this case, the packet will wander around the network until it is delivered to some node — probably 
not the intended recipient — or is dropped because of an exhausted VC. 


3.8.6 Corrupted Buffer Index 


The packet store (within an XB block) may find that the buffer index of a packet points to a packet buffer entry 
that is already full. The packet store will ignore the packet — that is, it will not write the packet into a packet 
buffer entry. If the buffer index is corrupted such that it places the packet in an unused buffer, the buffer slot will 
still be freed, as the CRC will not match. Packets that arrive with a bad CRC will never occupy packet store space 
—at most, they will be deleted from the packet store as soon as the IB tells the XB that the CRC was bad. The 
IB will ask for a retransmission, since the CRC will not match. 


5What is the worst case delay of acknowledgment, assuming no errors in control packets? A packet P1 is sent downstream that is 
20 FORDs long. In the worst case, a control packet begins just as that packet is completing, so it is not acknowledged until the second 
control packet. Two control packets take 30 cycles. Add 3 cycles in each direction for latency of the link. The worst case delay is 
around 20+30+3+3=56 cycles, which is enough time for 14 minimum sized packets to be sent. So even in the worst case, the replay 
buffer should not fill up unless there are bit errors on the link. 


May 14, 2014 130 Rev 51328 


SiCortex Confidential 3.8. ERROR DETECTION AND RECOVERY 


3.8.7 Corrupted LSN 


If the LSN field is corrupted in transit, the input block will discover that the CRC is bad. It doesn’t trust the 
LSN field until the CRC is checked, so no special recovery mechanism is needed. This type of corruption will just 
cause the FswPktCrcError counter to increment. 


3.8.8 Misc. Bad Data (CRC Mismatch) 


In this case, the IB will detect a bad CRC on the incoming packet. It will change the packet type to Poison as 
it forwards the data to the XB and the OB. Also the IB asserts a BadPacket signal to the XB so that the packet 
can be discarded from a crosspoint buffer. The IB goes into replay so that the packet will be retransmitted. 


3.8.9 Uncorrectable ECC Error in Packet Store or Replay Buffer 


This is bad, since we’ll end up with a non-delivered packet. When an uncorrectable ECC error is detected, the 
memory module asserts a double bit error flag which tells the OBX output mux to poison the packet. Also a CSR 
bit is set which, if enabled by software, will trigger an interrupt. 


3.8.10 Uncorrectable ECC Error on Data to DMA Engine 


When a crosspoint buffer detects an uncorrectable ECC error, it asserts a double bit error flag which tells the 
DMA output block to poison the packet. Also a CSR bit is set which, if enabled by software, will trigger an 
interrupt. 


3.8.11 Uncorrectable ECC Error on Data from DMA Engine 


The ICE9 memory system uses ECC to protect data from the moment it is written to an L2 segment until it 
reaches the fabric switch input block. For typical packet data, the processor generates the data in its L1 and asks 
the DMA to send it to a remote node. As that packet is sent onto the CSW to the DMA, an ECC code is generated 
that moves with the data as it goes through the CSW, DMA packet buffers, and into the fabric switch. In the 
fabric switch, the DMA input block corrects ECC errors before sending the data to a crosspoint buffer or to the 
output block for bypass. If there is a double bit error, it poisons the packet and asserts BadPacket at EoP time. 
(This is the same as what a normal input block does when it discovers any other kind of error.) 


3.8.12 Upstream Link Goes Down 


The fabric switch monitors the MissionMode signals coming from the 3 fabric link receivers to see if any upstream 
link has gone down. If an upstream link goes down, the fabric switch will treat any packets that are currently 
being received as error packets and enter replay. It will stop sending control packets upstream while MissionMode 
is down, and resume sending them after MissionMode goes back up. 

A processor on the node can learn that the upstream link is down by an interrupt from the fabric link, or by 
polling the link CSRs. 


3.8.13 Downstream Link Goes Down 


The fabric switch monitors the MissionMode signals coming from the 3 fabric link transmitters to see if a 
downstream link has gone down. If a downstream link goes down, the fabric switch will stop sending any new 
packets to the corresponding FLT. Packets that have been sent out already, or are currently being sent out, will 
remain in the replay buffer. Any control packets coming from the downstream fabric switch will be ignored while 
MissionMode is deasserted. Eventually, the link will go back up, MissionMode will be asserted again, and the 
fabric switch will resume its usual output behavior: sending packets downstream as long as buffers are available 
and accepting good control packets. There is no mechanism for a processor to extract packets from switch buffers 
or to drop packets destined for a bad link. A processor on the node can learn that the downstream link is down by 
an interrupt from the fabric link, or by polling the link CSRs. 

What happens to the system as a whole if a downstream link goes down forever? The fabric switch detects 
the loss of MissionMode from the FLT, and stops sending new packets on that link. Packets accumulate in the 
4 crosspoint buffers that feed that output port, and eventually the buffers fill up. Control packets carry that 
information to the upstream switch, so its crosspoint buffers start to fill. In the upstream switch, the packets 
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intended for the downed link fill up its four crosspoint buffers, and other traffic (not routed through the downed 
link) gets stuck waiting for available buffers. The congestion propagates back through the fabric and eventually 
DMA engines stall because they can’t put any more packets into the fabric. The fabric grinds to a halt. 

On the positive side, all of this will resolve itself quickly after the affected link comes back online. But if the link 
is determined to be down permanently, what can we do? I will describe a way to recover, in part to demonstrate 
why we have decided not to attempt it in this version of the chip. Let’s say that FLT2 on node X reports that the 
link is down, and software determines that the link is down for so long that it will never come back up. The fabric 
switch crosspoint buffers and replay buffers are full of packets destined for that link. First the processor would 
send a LinkDown message to every node, including itself, saying that it must must recompute its routing tables to 
avoid the affected link. Sending packets across the fabric as usual would not work, because parts of the fabric may 
be stuck by this time. Using the Out-of-Band channels or the system service processor would be possible. When a 
node Y receives the LinkDown message, it must send one LastPacketOnRoute packet along each route that touches 
the affected link, recompute its routing tables to avoid the downed link, then suspend sending any traffic along 
the affected routes until it gets a LastPacketOnRouteAck. Upon receiving LastPacketOnRouteAck, it can resume 
normal traffic along the new route. This handshake guarantees that all packets on the old route are delivered before 
any packets on the new route. After sending the LinkDown message, node X can start rerouting packets; it pulls 
them out of the replay buffer in order, generates a new header that routes the packet to the intended destination 
though a working link, and injects it into the fabric via its DMA. (NOTE: The FSW would need a mode that 
allows packets to flow into the replay buffer even though MissionMode is down.) Node X may have to do continue 
this for a very long time, until it drains the fabric and every DMA engine’s queues of any packets that required 
this link. Eventually it starts to see LastPacketOnRoute messages, and sends LastPacketOnRouteAck messages 
back to the sender so that they can resume sending traffic normally on the new route. It may be possible to know 
exactly how many LastPacketOnRoute messages to expect so that node X knows when to stop rerouting packet, 
but it’s probably easier to just do it indefinitely. A maskable interrupt that notifies the processor when the replay 
buffer is nonempty might be useful here. 

Having said all of that, based on the complexity of recovering from link failures without dropping or reordering 
any packets, and the hardware, software, and verification work involved in making this possible, we have decided 
NOT to support this. For this version of the chip, our strategy is to hope that the link comes up again, and if it 
doesn’t? Packets were lost in the switch, so do a machine check. 


3.9 The Control/Status Register Path 


3.10 Components and Hierarchy 


3.10.1 Switch Top level 
3.10.1.1 External Ports 


Inputs 


chaini_scbs_dat_sr Input chain for Serial Configuration Bus (SCB). All CSRs are accessed through the SCB. 
flrX_fsw_InDat_s0a<63:0> Input data from port X, where X is 0,1,2 

flrX_fsw_Dat Val_sOa True if InDat is carrying valid data. 

flrX_fsw_SoP_sOa True if InDat is carrying the first FORD in a packet — (Start-of-Packet) 
flrX_fsw_EoP_sOa True if InDat is carrying the last FORD in a packet — (End-of-Packet) 


flrX_fsw_Idle_sOa True if this is an inter-packet IDLE FORD ~- this carries out-of-band and error control 
information. 


flrX_fsw_MissionMode When clear, the fabric switch must ignore the SoP, EoP, Idle, DatVal, and InDat 
signals coming from Fabric Link Receiver X. When set, the signals from Fabric Link Receiver X are 
valid. 


fltX_fsw_Ct]lDat_sOa<7:0> Flow control, error notification, and out-of-band information from port X’s 
downstream node. 


fltX_fsw_NewCtlPkt_sOa CtlDat should be ignored, the next cycle’s value will be the first byte in a flow 
control packet coming from transmit port X’s downstream node. 
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fltX_fsw_CtlEoP_sOa The byte carried by CtlDat is the last in this control packet. 


fltX_fsw_MissionMode_sOa When clear, the fabric switch must not assert fsw_fltX_DatVals2a, and it 
must ignore any control packet traffic from Fabric Link Transmitter X. After MissionMode goes up, the 
fabric switch must not send any data packet until after a good control packet has been received. 


dma_fsw_InDat X_s0a<71:0> Data from the DMA engine destined for output port X. Bits 63:0 are the 
data, and bits 71:64 are a 64-bit ECC on the data. 


dma_fsw_Dat ValX_sOa Corresponding InDatX is valid 

dma_fsw_SoPX_s0Oa Corresponding InDatX is the first FORD in a packet 

dma_fsw_EoPX_s0Oa Corresponding InDatX is the last FORD in a packet 

dma_fsw_RdyX_sla Port X in the DMA engine is ready for a new packet from switch input port X. 


Outputs 


scbs_chaino_dat_sr Output chain for Serial Configuration Bus (SCB). All CSRs are accessed through the 
SCB. 


fsw_xxx_Int_sa Active-high interrupt triggered when any bit in the Interrupt Cause Register which is not 
masked by the Interrupt Mask register is set. The processor must determine the exact interrupt cause 
by reading CSRs. 


fsw_flt X_Out Dat_s2a<63:0> Output data to the fabric link transmitter for port X 

fsw_fltX_Dat Val_s2a Corresponding OutDat is worth looking at 

fsw_fltX_SoP_s2a Corresponding OutDat is the first FORD in a packet 

fsw_flt X_EoP_s2a Corresponding OutDat is the last FORD in a packet 

fsw_fit X_Idle_s2a Corresponding OutDat carries out-of-band and error control information 

fsw_flrX_Ct]lDat_s3a<7:0> Flow control data for the upstream control link from receive port X 

fsw_flrX_NewCtlPkt_s3a Corresponding CtlDat should be ignored, next value is the first data in a control 
packet. 

fsw_flrX_CtlEoP_s3a Corresponding CtlDat should be ignored, this is the last byte in a control packet. 

fsw_dma_Out Dat X_s2a<71:0> Output data from switch input port X to the receive port buffer X in the 
DMA engine. Bits 63:0 are the data, and bits 71:64 are a 64-bit ECC on the data. The ECC protects 
against single bit errors in DMA memories, DDR, and the L2 cache. 

fsw_dma_Dat ValX_s2a True if corresponding OutDat is worth looking at 

fsw_dma_SoPX_s2a You’ve probably noticed a pattern by now 

fsw_dma_EoP X_s2a Corresponding OutDat is the last FORD in a packet 


fsw_dma_BufA vailX_s3a If true, the DMA engine may send a transmit packet from DMA engine transmit 
buffer X to switch port X. 


3.10.1.2 Serial Configuration Bus Interface 


The fabric switch’s control/status registers are accessible through the SCB (Serial Configuration Bus) interface. 
To connect to the SCB, a module must simply instantiate an SCB slave module, and connect it to a global SCB 
chain. The input is connected to chaini_scbs_dat_sr and the output is connected to scbs_chaino_dat_sr. 

The SCB bus and the SCB slave module are documented in 10 (the Serial Configuration Bus chapter). 

The FSW’s control/status registers are documented in section 3.12.5. 


3.10.1.3 Interrupt Outputs 


The fabric switch produces an interrupt signal, when certain kinds of errors are detected or when out-of-band 
flags toggle. The interrupts are sent to the CSW, which distributes them to processors appropriately. The interrupt 
outputs are level sensitive, active-high signals. Interrupts turn on when the condition is first detected, and remain 
on until cleared via the SCB. 
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Figure 3.2: Data Path from DMA Engine Transmit Port 0 to Fabric Switch 


3.10.1.4 The DMA to Fabric Switch Interface 
Transmit Data Path 


dma_fsw_InDat X_s0a<71:0> Data+ECC from the DMA engine destined for output port X 


dma_fsw_Dat ValX_s0a Corresponding InDatX is valid. Asserted only during all data packet cycles including the 
SoP and EoP. 


dma_fsw_SoP X_s0a Corresponding InDatX is the first FORD in a packet 
dma_fsw_EoPX_s0a Corresponding InDatX is the last FORD in a packet 


fsw_dma_BufA vailX_s3a If true, the DMA engine may send a transmit packet from DMA engine transmit buffer 
X to switch port X. 


The DMA input block (DMAIT) asserts fsw_dma_BufAvailX_s3a when it has space for at least two packets in 
the outgoing crosspoint buffer. A few cycles after reset, BufAvail is asserted because all crosspoint buffers are free. 
Afterwards, if a newly arriving packet consumes the next-to-last buffer entry, the DMAI deasserts BufAvail within 
five Sclock cycles of the assertion of dma_fsw_SopN_s0a. When two or more buffers become free, the DMAI asserts 
BufAvail again. Deassertion of BufAvail is always the result of a packet coming in from the DMA, but assertion of 
BufAvail can happen at any time. 

The minimum sized packet is four FORDs: two payload FORDS, plus the head and tail FORDs. The maximum 
sized packet is twenty FORDs, of which eighteen form the payload. 

No retries are ever required on this interface. All packets are assumed to arrive in good health. Single bit ECC 
errors will be corrected on the fly before the data enters the XBX. 

SoP and EoP are each asserted for exactly one Sclock cycle. The two are always paired, with exactly one EoP 
assertion for every assertion of SoP. 

The format of the header and trailer FORDs is described in Sections 3.4.1.1 and 3.4.1.2. In the header FORD, 
the DMA engine fills in Vc, NumFords, HasCtrl, and Route. The FSW output block fills in XbeTarget and Lsn, 
and the link fills in the SoP. In the trailer FORD, the DMA engine fills in Type, ProcessIndex, and UnixProcessld, 
and sets Crce32 to zero. The FSW output block fills in Crce32, and the link fills in EoP as the packet goes onto the 
wire. 

Note that each DMA input block is connected to exactly one output block, so there is no mystery about which 
output port the packet will leave on. The route field is NOT shifted in the DMA input block. The two LSBs of 
route represent the output port number in the downstream switch. The two LSBs are used in the crosspoint buffer 
while arbitrating and selecting a downstream XbeTarget. 


Receive Data Path 
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Figure 3.3: Data Path from Fabric Switch to DMA Engine Receive Port 0 


fsw_dma_OutDatX_s2a<71:0> Output data+ECC from switch input port X to the receive port buffer X in the 
DMA engine. The DMAO module generates the ECC code on the fly as the packet travels from a crosspoint 
buffer to the DMA engine. 


fsw_dma_Dat ValX_s2a True if corresponding OutDat is worth looking at. Asserted only during all data packet 
cycles including the SoP and EoP. 


fsw_dma_SoPX_s2a You've probably noticed a pattern by now 
fsw_dma_EoPX_s2a Corresponding OutDat is the last FORD in a packet 
dma_fsw_RdyX_sla Port X in the DMA engine is ready for a new packet from switch input port X. 


The DMA engine asserts dma_fsw_RdyX_sla whenever it has space available in its port X receive buffer. If the 
packet sent from the switch to the DMA engine consumes the last such buffer, the DMA engine must de-assert 
RdyX within no later than 3 Sclock cycles after the assertion of fsw_dma_SoP X_s2a. 

The switch asserts DatVal0, and SoPO drives the header FORD onto OutDat0, followed by the payload (of no 
fewer than 2 payload FORDS, and no more than 18 payload FORDs) and the tail FORD. 

Data transfer along this path is assumed perfect. There is no recalculation of CRC, replay logic, length checking, 
etc. The only reason ECC is there is to protect the data later on, in DMA memories and beyond. 

SoP and EoP are each asserted for exactly one Sclock cycle. The two are always paired, with exactly one EoP 
assertion for every assertion of SoP. 

There is a potential race in this interface, in which the FSW consumes the last available buffer in the DMA, 
the DMA deasserts dma_fsw_RdyN, but the FSW doesn’t hear in time to suppress the next packet. To avoid this 
race, the DMA output block will observe the following rule: it will never assert SoP sooner than 6 sclk cycles after 
the previous SoP. This means that very short packets will have a gap after them. Long packets are not affected. 

The format of the header and tail FORDs is described Sections 3.4.1.1 and 3.4.1.2. 


3.10.1.5 The Fabric Link Receiver (FLR) to Switch Interface 
Receive Data Path 


flrX_fsw_InDat_sOa<63:0> Data arriving through link receiver port X 

flrX_fsw_Dat Val_sOa If true, then InDat is worth looking at 

flrX_fsw_Idle_sOa If true, then InDat is carrying IDLE FORD information (error control and status) 
flrX_fsw_SoP_sOa If true, then InDat is the first FORD of a packet 


flrX_fsw_EoP_sOa If true, then InDat is the last FORD of a packet 
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Figure 3.4: Receive Port to Fabric Switch Data Path 
Not shown: flt0_fsw_MissionMode_s0a is asserted throughout. 
(P0...PN are FORDs 0 through N of the payload. 4< N < 18.See Figure 3.4.1.3. 
IDP is an IDLE packet FORD. See Section 3.4.1.4.) 


The fabric link receivers (FLRO, FLR1, FLR2) send data (flrX_fsw_InDat_s0a<63:0>) and associated control infor- 
mation to each of the corresponding input blocks in the fabric switch. Figure 3.6 shows the relative timing between 
the control signals and the data. The first FORD is always marked by the presence of the SoP signal, and the last 
is marked by EoP. All signals are ignored if DatVal is not asserted. 

Note that there must always be exactly one cycle of EoP to follow ever SoP. SoP and EoP should never be 
asserted for more than one cycle. 

The switch also receives control and status information via IDLE packets. These are identified by the simulta- 
neous assertion of both DatVal and Idle. Section 3.4.1.4 shows the format of this packet. 


Receive Control Packet Path 
fsw_firX_Dat Val_s3a The data on fsw_firX_CtlDat_s3a is valid 


fsw_flrX_CtlDat_s3a<7:0> One byte of information to be sent to the upstream node via receiver port X’s control 
link output 


fsw_flrX_NewCtlPkt_s3a If true, then ignore CtlDat, this is the start of a control packet. (Next cycle’s CtlDat 
will be the first payload byte in the control packet 


Each downstream node sends flow control and error information back to the upstream node via the control link 
through the appropriate fabric link receiver. Control Packets are 15 bytes long including the SOP symbol that 
delimits packets. The packet is covered by a 32 bit CRC. See Section 3.4.2.1 for a description of the packet format. 

NewCtlPkt may be asserted for more than one cycle at a time, in this case the start of the next control packet’s 
payload is delayed until the deassertion of NewCtlPkt. 

The important parts of the control packet regulate both error recovery and buffer allocation. 

To review, buffer allocation is performed in the upstream switch. Each time a packet is transmitted by an 
upstream switch, it is assigned a slot in the downstream node’s packet store within the appropriate Crosspoint Buffer 
(XB) along with a packet sequence number (LSN). The upstream node remembers that buffer B was consumed by 
the packet with LSN L in his own LocalBufferBusy mask for the destination output port on the downstream node.® 
The upstream node will never assign a packet to a buffer it knows to be in use. The upstream node assumes a 
buffer is in use if it is assigned according to the LocalBufferBusy mask, or if it is assigned according to the PxBusy 
field in the last received control packet (where ’x’ is the output port number.) The upstream node clears packet 
L’s buffer busy bit in the LocalBufferBusy mask when the LSN reported in the last received control packet is equal 
to or greater than L. 


6Note that the buffer assignment is for a particular crosspoint buffer, so there is a LocalBufferBusy mask for each of the four output 
ports on the downstream node. Similarly, the downstream node reports buffer busy status for each of its four output ports. 
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Figure 3.5: Fabric Switch to Receive Port Control Data Path 
(CBO...CB14 are bytes 0 through 14 in the control packet. See Figure 3.1.) 
This picture doesn’t show fsw_flrX_DatVal_s3a and fsw_flrX_MissionMode_s0a, both of which must be asserted 
during all 15 bytes of valid Control Packets. 
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Sclk /SS\ 
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fsw_flt0. DatVal_s0a if) 
fsw_flt0_SoP_s0a ff 
fsw_flt0_EoP_s0a ff 
fsw_flt0_Idle_sOa ff 


Figure 3.6: Fabric Switch to Transmit Port 0 Data Path 
(P0...PN are FORDs 0 through N of the payload. 4< N < 18.See Figure 3.4.1.3. 
IDP is an IDLE packet FORD. See Section 3.4.1.4.) 


So, this is worth checking. The upstream node should never send a packet that is destined for a buffer that it 
should believe is busy. If such an event does occur, the packet will be ignored. 


3.10.1.6 The Fabric Link Transmitter (FLT) to Switch Interface 
Transmit Data Packet Path 


fsw_fltX_Out Dat_s2a<63:0> Output data from the switch to the downstream node 
fsw_fltX_Dat Val_s2a When true, OutDat is worth looking at 

fsw_fltX_SoP_s2a When true, OutDat is the first FORD in a transmitted packet 
fsw_fltX_EoP_s2a When true, OutDat is the last FORD in a transmitted packet 
fsw_fit X_Idle_s2a When true, OutDat is carrying an IDLE FORD. 


The transmit data packet path is the complement of the receive data packet path. Relative timing and meaning of 
the five signals is identical for both. Figure 3.6 shows the relative timing between the control signals and the data. 
Section 3.4.1.4 shows the format of this packet. 


Transmit Control Packet Path 


fltX_fsw_DatVal_sOa the data on fltX_fsw_CtlDat_s0a is valid 
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Figure 3.7: Transmit Port 0 to Fabric Switch Control Data Path 
(CBO...CB14 are bytes 0 through 14 in the control packet. See Figure 3.1.) 
This picture doesn’t show flt0_fsw_Dat Val_s3a, which must be asserted during all 15 bytes of valid Control Packets. 


fltX_fsw_CtlDat_sOa another payload byte in a control packet 


fltX_fsw_NewCtlPkt_sOa When true, ignore CtlDat, the next cycle’s value will be the first byte in a new control 
packet. Also indicates the previous byte was the last in a control packet. 


The transmit control packet path is the complement of the recieve control packet path. Control packets from the 
downstream node arrive at the fabric link transmit block and are forwarded to the output block (OB) in the switch. 
The OB parses the control packet (See Section 3.4.2.1) to determine the state of buffer allocation in the downstream 
node and to find the latest accepted packet sequence number. The timing and behavior of the two signals in this 
path are described in 3.7. 


3.10.2 Interblock Signals 


Figure 3.8 shows the signals between blocks of the fabric switch. 


3.10.3. The Input Block 


The input block (IB) distributes incoming FORDs from the attached input port to one the four crosspoint 
buffers (XBs) based on the routing field in the packet’s first FORD. The IB also decrements the VC if necessary 
and performs CRC checking on the packet as its last FORD passes through. It also checks to detect packets that 
have been poisoned. Such packets are removed from the packet store in the XB soon after the last FORD has been 
written. 

The IB remaps the virtual channel field in the header of each incoming packet based on the deadlock avoidance 
routing rules. It also shifts the routing vector two places to the right, by throwing away the two LSBs and shifting 
the input block number (0, 1, or 2) into the two MSBs. It uses the DecrVC register for this IB. DecrVC is a three 
bit vector, written by the SCB interface. If bit X in the vector for IB Y is set, then all packets arriving on port Y 
and destined for port X will have their VC decremented by one. Otherwise the VC field in the packet is unchanged. 

Finally, the IB checks the CRC at the end of the incoming packet and signals any detected error back to the 
input port and forward to the crosspoint buffers. The CRC field is 32 bits wide and is contained in the last FORD 
in the packet. (See Section 3.8.1.) 

The IB is also responsible for passing the “free buffer” vector from the arbitration array and the last good 
sequence number up to the input port. The IB builds control packets and sends them continuously to the upstream 
node. (See 3.4.2.) 


3.10.3.1 Error Detection and Recovery Table 


The following checks are performed on input data packets. Each of these checks becomes a column in the error 
behavior table below. 


e Pro: Was there a protocol error? Check that SoP was always followed by EoP, and EoP was always followed 
by SoP, and they were never asserted in the same cycle. If so, set IbProtocolErr. 


e DV: Was the flrN_fsw_Dat Val signal ever low during the data packet? If so, set IbMissingDatavalid. 


May 14, 2014 138 Rev 51328 


SiCortex Confidential 3.10. COMPONENTS AND HIERARCHY 


Input Block 
o sv N st bs 
© © 
fi) d N 
00) oa] n n 
d © n | © | 
n {0} aa] p va! % iy) 
| ql n yp Bw n Oo N + 
oO © qd n iS) ay H n fe} 
a d o 1S) fe) S x va 
n n n yp > ay (eo) ue) =) a4 
| | p © p p a H a 2) 
au a M4 q x Pood M4 Ko) ® 
ie) e) fe) (a J) ) fe) 3 © = 
n fea] a H Z 4 a (aa) (aa) (aa) 
YY y iy v She oa 
(00) (00) n d 
d d | n| 
Cc int Buff oe 
0) v 
rosspoint Butter al af & 4 
eS) ) p Q 
nN ica} M4 G 
fe) H 
Ay 
ty WA wv of & 3 a 
0 
© oe] od 
6 © aI oI st st a 
N N N {oJ N n {J n 
n n oO ny 0 N ny © | oO oO N | aA 
| | N N yn a p ~” ~” Q| yn 
N ~m n eS) Nn n 1) a n n Q| 
n n | 4 0) ue) Au | | ad ~Q = 
p fe) of a 4 v S p Q| s qi 
n Ay > x fe) iC) ° n GI © fe) 
oa] Q Au fs) fs) fs) i ia) an aa] aA = Oo fe) 
x x x €& ) dd v v aa) yp Au 
) 0) Q) Q) o 3 3 fe) ( ¢ ny sw n 
mG mG a a a a fe) Ay io) oO a Ay a 
Output Block 


Figure 3.8: Interblock Signal Connections 


e BNF: Bad NumFords. Was the NumFords field less than FSW_MINFORDS_PACKET or greater than 
FSW_MAXFORDS_PACKET? If so, set IbBadNumFords. 


e Lsn: The LSN should always equal last good LSN plus one (with wrap around at 16). Was the LSN ever 
something other than the expected value? If so, set IbMissingLsn. 


e LMin: Was the observed packet length less than the NumFords field specified? If so, set IbLengthErrMin. 
e LMax: Was the observed packet length greater than the NumFords field specified? If so, set IbLengthErrMax. 


e Xbe: Does the XB already have a packet in the buffer that the packet specified in XbeTarget? If so, set 
IbBadXbeTargetErr. 


e Vc: Did the VC decrement below zero? If so, set IbVcDecrErr. 
e Cre: Was there a CRC mismatch on a data packet? If so, increment R-FswDataCrcCounter (once per packet). 
Based on the result of each of these checks, the input block may decide to 


e Drop: Don’t send the packet to the XB or OB, and increment FswPktPoisonCounter. (Dropping of an errored 
packet is only possible when error is visible from just the header.) 


e Poison: Change the packet type to FSW_POISON_TYPE, and increment FswPktPoisonCounter. 
e Replay: Start replay sequence to ask the upstream node to try again. 
e Send: Send the packet normally and increment R_FswPktCounter. 


The input block behaviors correspond to the different error checks according to the following table. The columns 
on the left are all the types of error checks, and a 1 means that the error was detected. The columns on the right 
are the action that the fabric switch will perform. 
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3.10.4 The Output Block 


The output block performs global arbitration between the four attached crosspoint buffers, maintains a “replay 
buffer,” modifies outgoing packets with updated sub-band and error handling information, and computes the new 
CRC. The output block stores data for all bypass paths, and for the 3-cycle bypass path the OB decides whether 
to allow the bypass or not. 

Global arbitration is described in section 3.7.4. 

The replay buffer holds up to 15 data packets that were recently sent over the outbound link. Every packet 
is recorded in the replay buffer as it is transmitted, and packets are deleted when they are acknowledged by the 
downstream switch. Each packet in the replay buffer is stored at a fixed address according to its link sequence 
number (assigned by the output block). When an error is signaled on the return control link, the output block 
stops sending packets from the crosspoint buffers and sources packets from the replay buffer instead. 


3.10.5 The DMA Input Block 
3.10.5.1 Error Detection and Recovery Table 


The following checks are performed on data packets from the DMA. Each of these checks becomes a column in 
the error behavior table below. 


e Ecc2Head: Was there a double-bit error in the header of the data packet? If so, set DmaiDoubleBitErr. 


e BNF: Was the NumFords field out of range? If so, set DmaiBadNumFords. (NOTE: Check the NumFords 
field after ECC correction.) 


e Ecc2Other: Was there a double-bit error in any ford other than the header? If so, set DmaiDoubleBitErr. 
e Eccl: Was there a single-bit error in any of the fords of the data packet? If so, set DmaiSingleBitErr. 
Based on the result of each of these checks, the input block may decide to 


e Drop: Don’t send the packet to the XB or OB. (This is only possible when error is visible from just the 
header.) 


e Poison: change the packet type to FSW_POISON_TYPE. 


e Send: Send the packet normally and increment R-FswDmaiPktCounter. 


The DMA input block behaviors correspond to the different error checks according to the following table. The 
columns on the left are all the types of error checks, and a 1 means that the error was detected. The columns on 
the right are the action that the block will perform. The CSR column gives the name of the flag that is set or the 
counter that is incremented 


[ Bee2Htead | BNF | Bec20ther | Beet [| Drop | Poison | Send | 
NS 
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3.10.6 The DMA Output Block 
3.10.7 The Crosspoint Buffer 


3.10.7.1 The Arbitration Array 


What does it need to do? 


1. Maintain the “buffer busy” bits — set the bits as packets arrive, and clear them as they leave. 
2. Remap virtual channel id’s into downstream buffer requirements. 

3. Send flow control information back to the input block. 

4. Accept flow control information from the output block. 

5. Register incoming packets for arbitration. 

6. Fire arbitration when appropriate. 


7. Launch packets from the crosspoint buffer into the output block. 


The arbitration array selects the next XBE to be sent out of the XB based on which buffers are available downstream. 
First the XBE equal to the VC is considered; if it’s available the next XBE will be equal to the VC number. Then 
all XBE entries whose bit is set in the PoolMask are considered, lowest order bits first. This XBE selection is 
part of the local arbitration stage. The arbitration array also keeps track of occupied entries in the XB and sends 
updates to the appropriate Input Block (IB) which will then pass the information upstream. 

Upon arrival, each packet specifies the destination XBE in its first FORD. The destination is checked against 
the packet store’s valid bits. If there’s already a packet in that entry, the packet is ignored and treated as a 
BadXbeTargetError. Otherwise, the virtual channel specification from the first FORD is decoded into a 16 bit 
vector and ORed with PoolMask and stored for the destination XBE. The destination port on the downstream 
node (that is, the next routing token from the FORD) is written into the EntNextPort_s2a<1:0> register for 
the XBE. At the same time, the entry’s age vector XbeAgeVec_s2a<15:0> is set to all 1’s except for the bit in 
the age vector corresponding to the destination XBE. 

The arriving packet will begin to participate in local arbitration in the following cycle if it’s eligible. 

The OB sends each arbitration array updated buffer busy masks for each of the four outbound ports on the 
destination node. 


3.10.7.2 The Packet Store 


The packet store (PS) is a 72 bit by 320 word RAM organized as 16 blocks of 20 words each. Each word consists 
of 64 bits of data protected by 8 bits of ECC. Each 20 word block comprises a crosspoint buffer entry (KBE). The 
PS gets its input data directly from the input unit (though the ECC is generated in the packet store) and the write 
address comes from the arbitration array. Similarly, the arbitration array sends the read address to the PS. The 
PS can simultaneously read and write. It runs off of the fabric switch clock (SClock) — nominally 200MHz. 

For each XB entry, the packet store also stores a Valid bit, the VC (4 bits) and NextPort (2 bits) in flops, since 
those are needed in order to determine eligibility and make requests in every cycle. 


3.11 Pipeline Timing 


The following tables describe the pipeline stage in which different events occur. A brief version is presented 
first, followed by tables with more detailed descriptions. 
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3.11.1 Summary 


IBX checks for errors, modifies header, sends to XBX. XBX stores it. 
XBX chooses oldest eligible packet and asserts ReqPst to OBX. XBX starts to 
read packet store address 0 for the packet for which it is requesting. OBX does 
global arb, then per-VC-and-NextPort arb to select winning packet. 


OBX asserts GntPst to winning XBX, which starts to read packet store address 
1 of winning packet. 

OBX drives SoP data to Fabric Link Transmitter. All outputs are also stored in 
Replay Buffer. 


3.11.2. Incoming Packet is Stored in Crosspoint Buffer, Arbitrates, and Wins 


Each of these tables tells a story of how an input stimulus triggers actions within the block, and how quickly 
each block reacts to the stimulus. The first story follows a packet along a common path through the switch. It does 
not qualify for bypass, so it is stored in a crosspoint buffer until it wins arbitration and gets sent out. Because this 
is a common path that touches almost every subblock of the switch, we have used it to define the cycle numbers 
SO through $5. Other tables will refer back to these cycle numbers as we describe faster and slower paths through 
the switch. 
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The first ford of a packet (identified by the SoP pulse) is driven from the Fab- 
ric Link Receiver to the Fabric Switch IBX. The IBX performs minimal signal 
cleanup and then flops the data. 

The IBX checks the packet for errors (CRC, length, VC, and more). The routing 
string in the header is shifted, the VC is decremented. The IBX forwards packet 
data to four crosspoint buffers in mid-S1, along with ibx_xbx_PortSel wires which 
tell which crosspoint buffer is selected. The IBX also forwards packet data to 
every OBX to support bypass, but bypass will not be covered in this table. 

In the selected XB, compute the new age vectors and set the valid bit for the XB 
entry which will accept the packet. The packet store computes ECC and begins 
to write the packet on the $2 edge. Or, if the XB already has a packet in that 
buffer entry, prepare to raise xbx_ibx_BadBufldx in $2. 

In the XB, the new age vector and valid bit causes the packet to participate in 
local arbitration. If a free buffer is available, the packet is eligible; if it is the 
oldest eligible, it wins local arbitration. If any packet is eligible, the XB asserts 
xbx_obx_ReqPst, xbx_obx_Next VC, and xbx_obx_NextPort, and it starts to read 
address 0 of the packet. Unless global arbitration is inhibited, the OBX performs 
global arb betwen the four XBs that drive it. It takes the VC and NextPort of 
the winner and arbitrates between any of the four candidate packets that have 
the same VC and NextPort as the winner. The result of arbitration is flopped in 
OBX. 

Meanwhile, the set of XBE valid bits are sent from XBX to IBX 
(xbx_ibx_BMask_s2a) so that it can forward them to the upstream switch in 
control packets. 

OBX asserts obx_xbx_GntPst signal to the winning XB. The winning packet is 
whichever packet caused the XB to request in the previous cycle. (Since local 
arb happens constantly based on buffer busy masks, it is conceivable that the 
local arb winner in $3 is different from the winner in $2.) The XB starts to read 
address 1 of the packet. 

As soon as the grant arrives, the XB entry becomes free. The busy mask sent 
from XB to IBX reflects this in $4. This sounds scary, but it is safe because we 
are now committed to reading the XBE one ford per cycle; no incoming packet 
could ever overtake the read. We must be careful with how we represent the 
packet length so that it’s not overwritten if a new packet takes this XBE during 
the read. 

In the XB, the data from address 0 is now available. The XB does ECC correction 
and its output mux selects the packet store output and sends it to the OBX. The 
XB is responsible for filling in the "next XB” entry field in the header as it goes 
out to OBX. In OBX, the output mux selects data from the winning XB, inserts 
an LSN, computes CRC, and the data is flopped. 

As the packet header passes through, the length field is captured and used to 
decide how long to inhibit arbitration for the next packet. For a minimum length 
packet of 4 fords, arb can begin again in S6 so that it completes just in time to 
chose the next packet. As the length increases by one, the arb inhibit window 
increases by one as well. Arb inhibit is tuned so that a winner is chosen just in 
time to be sent immediately after the previous packet ends. 

The winning packet will consume a buffer in the downstream switch, and the 
OBX must remember that fact in a ”pessimistic busy mask” until a control packet 
arrives that acknowledges this packet. The pessimistic busy mask is updated so 
that in $4, the busy masks sent to the XBs reflect buffer status after accounting 
for the winner selected in $2. 

Data comes out of the flop and is sent to Fabric Link Transmitter. At the same 
time it is also written into the Replay Buffer, in case the packet needs to be 
resent later. The replay buffer recomputes ECC and stores 72 bits of data. If 
the packet being sent is only 4 fords long, and any XBs are requesting, global 
arbitration can begin in S6. 
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3.11.3. Packet Must Wait for Available Downstream Buffer 


A packet arrives in a crosspoint buffer, but it cannot arbitrate and be sent out the output port because it is 
waiting for a downstream buffer to become free. Let’s assume that we’re not waiting for a pessimistic bit in the 
local busy mask; we’re just waiting for the downstream buffer. 


SoP arrives from Fabric Link Receiver 


IBX checks for errors, modifies header, sends to XBX. XBX stores it. 


Stall 1 The packet requires a downstream buffer that is not available, so it cannot par- 
ticipate in local arb. Other packets that need different buffers will continue to 


flow. 
Stall N-1 | The end of a control packet arrives (from the FLT) which says that the required 
buffer has been freed. Control packet data is flopped in the OBX. 
Stall N | OBX checks the CRC and decides that the control packet is valid. OBX sends 
eal updated busy mask to each XBX. XBX flops the busy mask. 
82 Now the packet is eligible and (if it’s the oldest eligible) wins local arb. The XB 
asserts ReqPst to OBX. OBX does global arb, then per-VC-and-NextPort arb to 
select winning packet. XB begins to read packet at address 0. 
3 
XBX delivers packet SoP to OBX. OBX fills LSN and CRC fields and flops data. 
OBX drives data to Fabric Link Transmitter. All outputs are also stored in 
Replay Buffer. 


3.11.4 Packet Loses Global Arb, but Wins on Second Try 
Description 


SoP data arrives from Fabric Link Receiver on IB1 destined for OB2. 
IB1 checks for errors, modifies header, sends to XB12. XB12 stores it. 


Now the packet is eligible and (if it’s the oldest eligible) wins local arb. The XB 


asserts ReqPst to OB2. OB2 does global arb, then per-VC-and-NextPort arb to 
select winning packet. XB12 begins to read packet at address 0. But OB2 selects 
a packet from XBO02 that is 8 fords long instead. Global arb is disabled for 8 
cycles. 

Stall 1 | Packet is still oldest eligible, so XB12 continues to assert ReqPst. But global arb 
[ere is disabled so the request is ignored. XB12 reads the packet at address 0 again. 
Se ee ee el 

Stall 7 | Packet is still oldest eligible, so XB12 continues to assert ReqPst. But global arb 

is disabled so the request is ignored. XB12 reads the packet at address 0 again. 
Global arb is enabled again. This time global arb selects the packet in XB12. 
$3 OB2 asserts GntPst to XB12, which finally starts to read address 1 of the packet. 
lie «| XB12 invalidates the crosspoint buffer entry. 
55 OB2 drives data to Fabric Link Transmitter. All outputs are also stored in 
Replay Buffer. 


3.11.5 Packet with CRC Error is Poisoned and Sent Anyway 


Because we do cut-through routing, a packet may already be on its way out to the next node before we discover 
an error such as CRC, which is only detectable at the end. All we can do is poison the packet (change the type 
field in the last ford) and try to cancel it if it’s still waiting to be transmitted. In this example the packet is 6 fords 
long: FORD1 through FORD6. 
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IBX checks for errors, modifies header, sends to XBX. XBX stores FORD1. 
FORD2 arrives from FLR, and input block continues to compute CRC. 

XBX chooses this packet in local arbitration and asserts ReqPst. OBX does 
global arb and selects this packet. 

OBX asserts GntPst to the winning XBX. XBX clears the valid bit for the XB 
entry so that it can be reused. 

OBX drives FORD1 to Fabric Link Transmitter. 

FORD6 arrives from FLR. The input block computes final CRC and it doesn’t 
match the CRC field in FORD6. Now we know there was a CRC error, but the 
XB and OBX have already begun to send the packet. The IB changes the packet 
type to Poison as it sends to XB, asserts ibx_xbx_BadPacket_sla, and increments 
a CRC error count register. The BadPacket signal causes the XB to clear the 
valid bit for the XB entry, but it was already cleared by GntPst in $3 so this has 
no effect. 

XBX sends FORD4 to OBX. OBX drives FORD3 to FLT. 

XBX sends FORD5 to OBX. OBX drives FORD4 to FLT. 

XBX sends FORD6 to OBX. (It already has the Poison type because the IB 
changed it.) OBX drives FORD5 to FLT. 


OBX drives FORD6 to FLT. 


3.11.6 Packet with CRC Error is Dropped 


If the errored packet sits around in the crosspoint buffer long enough, we have time to cancel it before it goes 
out. In this example, we show how that would work. Consider the same 6-ford packet, but this time the XB was 
not able to send it out because of contention for the output port. 


It is important to invalidate errored packets that are consuming crosspoint buffer entries. Before long, replay 
will provide the good version of the packet and try to put it in the same crosspoint buffer entry. If the entry is still 
filled by a bad packet, we would have to keep replaying until the junk packet wins arbitration and gets sent out. 


FORD1 arrives from Fabric Link Receiver. 
Sl IBX checks for errors, modifies header, sends to XBX. XBX stores FORD1. 
FORD? arrives from FLR, and input block continues to compute CRC. 
XBX chooses this packet in local arbitration and asserts ReqPst. OBX does 
global arb, and some other packet is selected. The XBX continues to request for 
this packet. 


FORD4 arrives in IBX. The XBX continues to request. 
FORD5 arrives in IBX. The XBX continues to request. 


S5 FORD6 arrives from FLR. The XBX continues to request. 
The input block computes final CRC and it doesn’t match the CRC field in 
FORD6. Now we know there was a CRC error! The IB changes the packet type 


to Poison as it sends to XB, asserts ibx_xbx_BadPacket_sla, and increments a 
CRC error count register. The BadPacket signal causes the XB to clear the valid 
bit for the XB entry. 

Because the XB entry for the bad packet is no longer valid, the XB stops re- 
questing in S6. 

There’s still one last way that the packet will be sent out. If in $6, a GntPst 
arrives from the OBX, the XB would still have to send the poisoned packet. 


(Remember, grants always apply to the request that was made one cycle before.) 
Otherwise, the bad packet is dropped. 
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3.11.7 About the Bypass Paths 


The canonical path through the fabric switch (above) has 6 cycles of latency, but our goal is 3 cycles of latency. 
When the switch is not busy and all required resources are available, packets can bypass the crosspoint buffer and 
go straight from the input block to the output block. But before we can accept a packet for bypass, we must check 
several things. 


1. Availability of downstream buffers 


2. Eligible packets in any XB contending for the same output port must go first, because they are clearly older 


3. Packets arriving simultaneously in other IBXes destined for the same OBX (only one can bypass) 


4. The output port may be busy 


5. Arbitration in OBX may be disabled because a packet is going out already, or because we are in replay 


Usually packets travel from IBX to XBX to OBX, but there are timing concerns about using this path in the 
minimum latency case. To solve this, in the start-of-packet cycle, the IBX forwards packet data and all neces- 
sary control signals directly to the OBX (in addition to the XBX) as it arrives. In this case, the IBX asserts 
ibx_obx_ReqBypS1_sla to the OBX, and the OBX decides whether to allow the packet to bypass or not. Since the 
OBX does all arbitration between XBs and has complete information about which downstream buffers are free, the 
OBX will perform the checks for bypass eligibility as well. 


Another common case is that a packet arrives while the OBX is busy, but it becomes free one or two cycles 
later. It’s a shame to make these packets wait for the 6-cycle latency path when they could in theory go through 
in 4 or 5 cycles. To accomodate these packets that just missed the window of opportunity, we provide two other 
bypass options by delaying the data for one or two cycles in a pipeline in the OBX. 


The four types of requests presented to the OBX are described below. 


Request Signal 
ibx_obx_ReqBypSl_sla | IBX to OBX | Request to send out the packet that is now being for- 
warded from IBX to OBX. If accepted, the latency is 3 


cycles. 


xbx_obx_ReqBypS2_s2a | XBX to OBX | Request to send out the packet that was forwarded from 
IBX to OBX one cycle ago. If accepted, the latency 
is 4 cycles. This type of request is only allowed if 
obx_xbx_PktCanBypass_sla was asserted in the previous 


cycle. 


xbx_obx_ReqBypS3_s2a | XBX to OBX | Request to send out the packet that was forwarded from 
IBX to OBX two cycles ago. If accepted, the latency 
is 5 cycles. This type of request is only allowed if 
obx_xbx_PktCanBypass_sla was asserted in the previous 


cycle. 


xbx_obx_ReqPst_s2a XBX to OBX | Request to send out a packet from the XB packet store. 
Data will be sent from XBX to OBX one cycle later. If 
accepted, the latency is 6 cycles (assuming downstream 
buffers are available and packet wins all arbitration on 


first attempt). 


The OBX considers all requests and chooses a winner. It drives the following signals to tell the XBX what is 
going on. 
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Grant Signal 


obx_xbx_GntPst_s3a OBX to XBX | Indicates that the request made in the previous cycle was 
granted, and the XB should drive the packet data to the 
OB in the next cycle. The OB can assert GntPst in re- 
sponse to ReqPst. The XB invalidates the packet’s XBE 


immediately to prepare for a new packet. 


obx_xbx_GntByp_s3a OBX to XBX | Indicates that the request made in the previous cycle was 
granted, and data was already present in the OB. The OB 
can assert GntByp in response to ReqBypS1, ReqBypS2, 
or ReqBypS3. The XB invalidates the packet’s XBE im- 


mediately. 


obx_xbx_PktCanBypass_sla | OBX to XBX | PktCanBypass=1 tells the XB that it is allowed to use 
bypass requests ReqBypS2 and ReqBypS3 starting in the 
following cycle. If PktCanBypass=0, it can only assert 
ReqPst requests. It is valid all the time, generated based 


on the state of the output port. 


In the following tables, the 3, 4, 5, and 6 cycle paths through the switch will be described. 


3.11.8 3 Cycle Latency Path 


This path shows how a packet would see minimum latency through the switch. If all the required resources are 
available, the packet can go straight from the input block to the output block with a total latency of 3 cycles (15 


ns). 


Description 


SoP data arrives from Fabric Link Receiver on IB1 destined for OB2. 

IB1 checks for errors, modifies header, sends data to both XB12 and OB2. It 
asserts ibl_ob2_ReqBypSl_sla. Because a downstream buffer is available and 
there is no contention, OB2 decides to allow bypass. The OB2 output mux 
selects the bypassed data from IB1, fills the LSN and CRC fields and flops the 
data. Now we can go straight to S5! 

OB2 drives data to Fabric Link Transmitter. All outputs are also stored in 
Replay Buffer. 

OB2 asserts ob2_xb12_GntByp_s2a, to tell XB12 that the incoming packet was 
selected for bypass. The XBX clears the valid bit on the XB entry into which 
the packet is (still) being written. 


3.11.9 4 Cycle Latency Path 


SoP arrives from Fabric Link Receiver on IB1 destined for OB2. 
Sl 
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IBX checks for errors, modifies header, sends data to both XB12 and OB2. It 
asserts ibl_ob2_ReqBypSl_sla. But OB2 is still sending out a packet, so the 
bypass request is rejected. OB2 places the data in its bypass delay pipeline. 
XB12 chooses the oldest eligible packet. If the incoming packet is se- 
lected and PktCanBypass was asserted in the previous cycle, XB12 asserts 
xb12_ob2_ReqBypS2_s2a. 


Let’s say that the OB2 output port is no longer busy and the bypass packet is 
selected. OB2 reads from the $2 stage of its bypass pipeline, fills in LSN and 
CRC, and sends the packet out immediately. We can skip to S5! 

OBX drives data to Fabric Link Transmitter. All outputs are also stored in 
Replay Buffer. The OBX asserts ob2_xb12_GntByp_s3a to inform XB12 that its 
packet won. The XBX clears the valid bit on the XB entry into which the packet 
is (still) being written. 
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3.11.10 5 Cycle Latency Path 


[So [ SoP anives fom Fabric Link Recoiver on IBI destined for OBZ. 

Sl IBX checks for errors, modifies header, sends data to both XB12 and OB2. It 
asserts ibl_ob2_ReqBypSl_sla. But OB2 is still sending out a packet, so the 
bypass request is rejected. OB2 places the data in its bypass delay pipeline. 

82 XB12 chooses the oldest eligible packet. If the incoming packet is se- 
lected and PktCanBypass was asserted in the previous cycle, XB12 asserts 
xb12_ob2_ReqBypS2_s2a. 

But the output port is still busy, so nobody wins. 

$3 XB12 chooses the oldest eligible packet. If the incoming packet is se- 
lected and PktCanBypass was asserted in the previous cycle, XB12 asserts 
xb12_ob2_ReqBypS3_s2a. 

Let’s say that the OBX output port is no longer busy and the bypass packet is 
selected. OB2 reads from the $3 stage of its bypass pipeline, fills in LSN and 
CRC, and sends the packet out immediately. We can skip to S5! 

S5 OBX drives data to Fabric Link Transmitter. All outputs are also stored in 
Replay Buffer. The OBX asserts ob2_xb12_GntByp_s3a to inform XB12 that its 
packet won. The XBX clears the valid bit on the XB entry into which the packet 
is (still) being written. 


3.11.11 6 Cycle Latency Path (No Bypass) 


This is the canonical 6-cycle path through the fabric switch again. I include it to contrast it with the bypass 
paths. Here you can see how the bypass logic disables itself. 


SoP arrives from Fabric Link Receiver on IB1 destined for OB2. 


Sl IBX checks for errors, modifies header, sends data to both XB12 and OB2. It 
asserts ibl_ob2_ReqBypS1_sla. But OB2 is still sending out a packet, and there 
are 3 fords left to transfer so bypass is not going to help anybody. 


The output block always knows how many fords are remaining, and it uses that 
value to produce ob2_xbx_PktCanBypass_sla. In every cycle, this signal tells 
crosspoint buffers whether they should use bypass requests or packet store re- 
quests in the next cycle. In this case, bypass is useless so PktCanBypass would 
be deasserted. 

XB12 chooses the oldest eligible packet. If any packet is eligible, XB12 as- 
serts a request...but which kind of request? It considers using a bypass request, 
but it can’t because PktCanBypass was off in the previous cycle. So, it raises 
xb12_ob2_ReqPst_s2a and starts to read the packet. Global arb selects XB12 as 
the winner. 


packet store address 1 of winning packet. 


XB12 delivers packet SoP to OB2. OB2 fills LSN and CRC fields and flops data. 


OB2 drives SoP data to Fabric Link Transmitter. All outputs are also stored in 
Replay Buffer. 


OB2 asserts ob2_xb12_GntPst_s3a to the winning XB12, which starts to read 


3.11.12 End of Control Packet Arrives, Packets are Acknowledged 

information propagates into the output block and crosspoint buffer. At first, imagine that the replay buffer 
contains 3 packets: LSN 6, 7, and 8. The replay write LSN is 9, so the next data packet will have LSN 9. The 
replay read LSN is 6. The last acknowledged LSN (AckLSN) is 5. 
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SO The final byte of a control packet arrives at OBO. The data is flopped on the 
rising edge of S1. 

Sl The CRC is checked, and the control packet is found to be good. Write the 
AckLSN, buffer busy masks, out of band data, etc. at rising edge of 52. The 
new AckLSN is 7, acknowledging correct receipt of LSNs 6 and 7. 


82 The new downstream busy mask is ORed with the pessimistic busy mask pro- 
duced by the replay buffer, and driven from OBO to its four crosspoint buffers, 


which flop the busy mask. In the replay buffer, the read LSN is compared com- 
pared with AckLSN. All LSNs up to and including AckLSN are acknowledged, 
and the busy mask bits for any buffers that the acknowledged packets consumed 
are cleared. The replay read LSN is set to 7. 

83 Crosspoint buffers may now make requests based on the busy bits from the new 
control packet. The pessimistic busy mask now shows that LSN6’s and LSN7’s 
buffers are available, so the busy mask sent to XBs may change. 


3.11.13 End of Control Packet Arrives with ErrFlag=1, Causing Replay 


At first, traffic is flowing normally from the crosspoint buffer and control packets are acknowledging packets 
without error. This corresponds to Replay State = NORMAL. Then a control packet arrives with ErrFlag=1 and 
starts the replay sequence. 
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The final byte of a control packet arrives at OBO. The data is flopped on the 
rising edge of S1 in temporary registers. 

The CRC is checked, and the control packet is found to be good. Write the 
ErrFlag and AckLSN to registers that are visible to the output block. 


$2 In the replay buffer, the read LSN is compared compared with AckLSN. All LSNs 
up to and including AckLSN are acknowledged as usual. But because ErrFlag is 
asserted, global arb is inhibited starting in S2. Any requests granted in previous 


cycles must be completed, but from $2 to the end of replay, no more grants will 
be issued. Let’s assume a crosspoint buffer requested in $1 and won. In 82, the 
grant goes back to the crosspoint, and the packet is sent out the output mux 
during the next few cycles. Wait for the packet SoP to be sent before setting 
ErrAck. This guarantees that an idle packet with ErrAck does not sneak out 
before the last normal packet. 


WaitToAck 2 | The packet that won global arb in S1 is sent to the FLT. Now set ErrAck=1 and 
ReplayState=HANDSHAKE. The ErrAck will be carried downstream in Idle 
packets. Increment replay counter CSR. 

Handshake In HANDSHAKE state, global arb is still inhibited. Wait until a new control 
Peer | packet arrives with ErrFlag=0. 
a asa | Sa ee ere 


Handshake SO | The final byte of a control packet arrives at OBO, containing the ErrFlag bit = 
0. The data is flopped on the rising edge of $1 in temporary registers. 


Handshake $1 | The CRC is checked, and the control packet is found to be good. Write the 
ErrFlag and AckLSN to registers that are visible to the output block. 


Handshake $2 | Once ErrFlag=0, assuming there are packets in the replay buffer, the replay state 
is changed to REPLAY, and the replay loop counters are initialized to start at 
the first packet after the AckLSN. Begin to read the first FORD of the first 


packet to be replayed. If the replay bufer is empty, set ReplayState=NORMAL 
and skip to Done! 


83 Memory read cycle. Increment loop counters and start next read. 
a eee 
$4 Data emerges from replay buffer. Do ECC correction, select replay data on the 
* output mux, insert LSN (from the replay buffer address), compute CRC and flop 
it. 
S5 First FORD of replayed data is sent to FLT. Unlike other data packets, don’t 
[eee record replay packets in the replay buffer! 


Replay 1...N-1 | Continue to increment loop counters, read replay buffer, and send packets back- 
Replay N Replay loop counter reaches the end of the last packet in the replay buffer. Set 
oo replay state=NORMAL. 


Done! Because replay state=NORMAL, global arb is enabled again. A packet store or 
bypass request could arb and win in this cycle. 


3.12 FSW Registers and Definitions 


3.12.1 Package Attributes 
Package 


chip_fsw_spec 


3.12.2 Definitions 
Defines 
FSW 
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32’d04 MINFORDS_PACKET Minimum number of fords in a single packet. 
32’°d19 MAXFORDS_PACKET Maximum number of fords in a single packet. 


32’°d16 LSN_MAX How many LSN values are there? This determines the size of some 
memories. 


32’d2 INITIAL_LSN The LSN of the first data packet after an output block is reset. The 
eres output block initializes its LSN pointers to this value. 
32’d1 INITIAL_LGSN After reset, the Last Good Sequence Number register in an input 
block is set to this value. INITIAL_LGSN + 1 should equal INI- 
TIAL_LSN. 
32’d64 VC_NP_RR_TABLE_SIZE | Size of the round robin table for every combination of VC and 
IM NeaPort."Dhere re 16 Vs and 4 NePort a9 he 


4Vb1111 POISON_TYPE The packet type field in the trailer that is recognized by the switch 
is the poison type. The fabric switch spec defines the poison value to 
be all ones. 


37d16__| XBLNUMCENTRIES 
374 | NUM-PORTS 


32’d16 NO_XBE_AVAILABLE For functions that return a crosspoint buffer entry number, this value 
means that no crosspoint buffer was available. 


3.12.3. Output Mux Select Choices 


Enum 
FswOutSel 


Send packets from the replay buffer 


3.12.4 Replay State Machine 


Enum 
FswReplayState 


2’d0 NORMAL Normal operation. Global arb enabled, each packet written to the 
replay buffer as it is transmitted. Transition to HANDSHAKE if 
ErrFlag asserted. 

2’d1 HANDSHAKE Assert error acknowledge flag in idle packets. Global arb disabled. 
Acknowledge packets up to and including the LSN in control packets. 
Wait for ErrFlag to be deasserted and then transition into REPLAY. 


2’d2 REPLAY Global arb disabled. Resend packets from the replay bufer starting at 
the acknowledged LSN + 1. When the SoP of the last replay packet 
is sent, return to NORMAL state. 


3.12.5 Fabric Switch Control/Status Registers 


This section defines all the CSRs for the fabric switch. All fabric switch registers are accessible through the 
SCB (Serial Configuration Bus). Verification code may also use “direct read” and “direct write” methods to access 
any register in zero simulation time. 
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The registers are organized into the following sections: registers that affect the operation of the FSW, perfor- 
mance and error counters, and control of interrupts. 


3.12.5.1 Block Reset Register 


This register allows each block of the FSW to be reset individually. Each block has an active-high signal which 
causes that block to change everything back to its initial state and ignore input traffic. The individual resets are 
provided so that if one link needs to be reset, only the blocks related to that link would need to be affected. 

All blocks start out in reset after power-on, so the fabric will be idle until software deasserts reset to all the 
blocks it needs. Normally, software would enable every block by writing all zeroes. But the whole point of separate 
reset bits is to reset them separately so here are a few scenarios which would take advantage of this ability. If an 
output link is known to be bad, software can keep the OB and the four XBs that drive it in reset. Example: FLT1 
is a bad link, so assert reset in bit 4 and bits 12+0*4+1=13, 12+1*4+1=17, 12+2*4+10=21, and 12+3*4+1=25. 
Or if software wants to reinitialize the DMA engine, it may want to disable all traffic to and from the DMA; in 
that case it would assert reset in DmaiReset and DmaoReset registers until the DMA is ready to receive packets. 
Finally, if software needs to reset an entire ICE9, it should also ask neighboring ICE9s to reset the part of their 
fabric switch that faces the device that was reset. In the three upstream ICE9’s, the output block should be reset to 
clear the replay buffer and any lingering LSN state. (If you want to clear/drain old traffic, reset the four crosspoint 
buffers leading to the output block too.) In the three downstream ICE9’s, the input block should be reset to clear 
the replay and LSN state. 

NOTE: We will not verify 2°27 combinations of reset signals. We will verify the poweron case, operation with 
a few blocks permanently disabled, and recovery after an upstream or downstream switch has been reset. 

NOTE: When an input block is held in reset, it sends no control packets. This will cause the connected fabric 
link receiver to lose its heartbeat and go into retraining. Similarly, when an output block is held in reset, it sends 
no idle packets, and the fabric link transmitter will lose its heartbeat and go into retraining. To avoid this, the link 
can be reset as well. 

Register 

R_FswBlockReset 

Attributes 

-noregtest -kernel 

Address 

0xE_7D00_001C 


27 oi 1:2 Wer a =f |__| Reserved 


| | Reserved 
ae 12 sc sae One bit per crosspoint buffer. Bit 12+X*4+Y affects 
crosspoint buffer XY. (There is no crosspoint buffer 
XB33.) While XbReset is asserted, invalidate all cross- 
point entries. 


DmaoReset One bit per DMA output block. Bits 11:9 affect 
DMAO2,1,0. While DmaoReset is asserted, reset any 
state in the Dmao block. 


DmaiReset | RW One bit per DMA input block. Bits 8:6 affect DMAI2,1,0. 
While DmaiReset is asserted, reset any state in the Dmai 
block. 


5:3 ObReset One bit per output block. Bits 5:3 affect OB2,1,0. 

Reset output block to initial conditions. While ObReset 
is asserted, the output block invalidates all entries in the 
replay buffer, sets read and write LSN to 2, sets AckLSN 
to 1 (pretending it’s received 1 from control packets), and 
immediately cancels any packet that is in the process of 
being sent. In reset the OB will not update counters or 
error flags, and will not send Idle packets. 


TbReset RW One bit per input block. Bits 2:0 affect IB2,1,0. 
Reset input block to initial conditions. While IbReset is 
asserted, the input block will set its LGSN to 1 and clear 
any error state. It will not update counters or error flags 
and will not send control packets. 
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3.12.5.2 Block Enable Register 


Register 
R_FswBlockEnable 
Attributes 
-noregtest -kernel 
Address 
0xE_7D00_005C 
lee Ree ee 


Bea 12 Lec One bit per crosspoint buffer. Bit 12+X*4+Y affects 
crosspoint buffer XY. (There is no crosspoint buffer 
XB33.) While XbEnable is low, ignore new packets and 
generate no requests. 


DmaoEnable One bit per DMA output block. Bits 11:9 affect 
DMAO2,1,0. While DmaoEnable is low, ignore requests 
from crosspoint buffers so that no traffic is sent to DMA. 

DmaiEnable One bit per DMA input block. Bits 8:6 affect DMAI2,1,0. 
While DmaiEnable is low, drop any incoming packets by 
disabling PortSel signals to XB and OB. 

ObEnable One bit per output block. Bits 5:3 affect OB2,1,0. 

While ObEnable is low, the output block ignores all re- 
quests to send new data packets. It will continue to send 
Idle packets so that the out-of-band channel works. Any 
data transmission or replay that is in progess when Enable 
goes low will continue until it completes. 

IbEnable One bit per input block. Bits 2:0 affect IB2,1,0. 

While IbEnable is low, the input block will drop any in- 
coming packets by disabling PortSel signals to XB and 
OB. It will continue to send control packets so that the 
out-of-band channel works. 

Bug2014: The FSW contains a logic bug which affects the behavior of IbEnable and DmaiEnable. When 
IbEnable[N] is low, the input block is supposed to block any packets from entering the crosspoint buffers, and it 
does that correctly. But, when an errored packet is detected in input block N and IbEnable[N] is low, input block 
N may incorrectly ask the connected crosspoint buffers to erase the packet that was most recently sent there, via 
the ibx_xbx_BadPacket signal. If there is a packet in the crosspoint buffer, it will be cancelled if it hasn’t been 
selected to go out the OBX yet. The result is that a packet that was sent from IBX to XBX while IbEnable[N] was 
high MIGHT be erased from the crosspoint buffer, if it’s still there when IbEnable[N] is set to low. The same goes 
for Dmai. The simplest and most likely software workaround is to always set IbDEnable=7 and DmaiEnable=7 and 
never touch them. Or, before changing IbEnable or DmaiEnable bits from 1 to 0, ensure that all traffic has flowed 
out of the connected crosspoint buffers by watching the BusyMask values in captured control packets. But usually 
if you’re going to set these bits to 0, you are expecting packets to be dropped anyway so it may not matter. 


3.12.5.3 Input Block Mode Register 
There are three mode registers. R-FswIbMode[X] describes the behavior of input block X. 


Register 
R_FswIbMode[2:0] 


Attributes 


-kernel 


Address 
0OxE_7D00_0010 - 0xE_7D00_0018 
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aad SSCdCSSCS*dSCSC*id Reed SSOSCSCSCSC“‘“S*S*SC*d 


PktDecrementVc | RW 7 Packet decrement VC. 
When set, configures IBX to decrement VC field in header 
of a data packet. Bit assignment is for [Link-2,Link- 
1,Link-0]. The default value is SET. 


PktCrcEna RW Packet CRC checking enable. 
When set, enables CRC checking on incoming data and 
idle packets. 


3.12.5.4 Output Block Mode Register 


There are three mode registers. R-FswObMode[X] describes the behavior of output block X. 


Register 


R_FswObMode[2:0] 


Attributes 


-kernel 


Address 


—— - 0xE_7D00_0008 
reeds | oi Reserved. 


ie tae Control packet CRC checking enable. 
When set, enables CRC checking on incoming control 
packets. 


3.12.5.5 PoolMask Register 


There are three PoolMask registers. R-FswPoolMask[X] affects the behavior of the crosspoint buffers that drive 
output block X. 

The PoolMask register specifies which buffers are dedicated and which are in the common pool. For example, 
the value 0xFFCO indicates that crosspoint buffer entries 0-5 are dedicated to VCs 0-5, and entries 6-15 are pool. 
If traffic is sent on any VC which has no dedicated buffer, deadlock may result. 


Register 
R_FswPoolMask[2:0] 


Attributes 


-kernel 


Address 


0xE_7D00_0060 - 0xE_7D00_0068 
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[Bit__[ Mnemonic [Access [Reset_[ Type [Definition ———SSCSCSCSCSCSCCSCCC—=*d 


| 31:16 | 1:1 7 |__| Reserved 


pi — || Reseed) 

aoe 0 RI si saa Sets the PoolMask vector for an output port. The pool 
mask is 16 bits wide. If bit X is clear, then buffer slot X 
in each of the affected XBs is dedicated to traffic on VC 
X, so VC X can safely be used. If bit X is set, then buffer 
slot X in each the port’s XBs is considered in the shared 
pool, and VC X must not be used. See the discussion of 
the XB, Section 3.10.7. The only useful settings consist 
of ones in the MSBs followed by zeroes in the LSBs, e.g. 
0x8000, OxF000, OxFF00, OxFFFO, OxFFFE. The default 
of OxFF00 is correct for 8 VCs. We expect that all Pool- 
Mask values in all ports in every node will be set to the 
same value. 


3.12.5.6 Out-of-Band Upstream Register 


There are three upstream registers. R-FswOobUp[X] is used to send and receive data to/from the upstream 
fabric switch via Fabric Link Receiver X. For a description of out-of-band communication, see section 3.5.1. Bits 
9:0 are used to send data upstream. Bits 25:16 are used to receive data from the upstream switch. 

Register 

R_FswOobUp/2:0] 


Attributes 


-kernel 


Address 
0xE_7D00_0080 - 0xE_7D00_0088 


ES OO 
-25__[ReevEmpiy [RX |__| Empty flag rom the upstreamnode——SSSS—S 
2d RecvTaken [R__[X | | Taken fag from the upstream node —SSS—S 
23:76 | RecvData_[R__[X _| __| Shits of data from the upstream node_——S—S 


histo dS SCidCSC*dCSC*‘d Reseed SSCSC—C~—S 
[9 | SendRimpiy [RW [1 __|___| Bimpty flag to be sent to the upstream node ___—~+t 
PS __[SendTaken [RW [0 |__| Taken Hag to be Sent to the upstream node ___——*+t 
P70 _[SendData [RW [0 |__| Bits of data to be sentupstream ————SSS—~SY 


3.12.5.7 Out-of-Band Downstream Register 


There are three downstream registers. R-FswOobDown[X] is used to send and receive data to/from the upstream 
fabric switch via Fabric Link Transmitter X. For a description of out-of-band communication, see section 3.5.1. 
Bits 9:0 are used to send data downstream. Bits 25:16 are used to receive data from the downstream switch. 

Register 

R_FswOobDown|2:0] 

Attributes 


-kernel 


Address 
0xE_7D00_00A0 - 0xE_7D00_00A8 
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psp dCdCSC*dSCC*‘di Rese. SSSOSCSC—CSC“S*“‘“S*S~* 

[25__[RecvEmpiy [RX |__| Empty flag from the downstream node —S—~S 

21 | RecvTaken [R__[X | | Taken fag from the downstream node | 

PRecvData_[R__[X |__| Shits of data from the downstream node ———s 
ad 


rissa | di Reserved SCSCSCSCSCSCSC*S 
(9 __[Sendimpiy [RW__[T |__| Emp iy flag to be sont fo the downstream node 
[S| SendTaken [RW [0 |__| Taken lag to be sent to the downstream node | 
P70 _[SendData [RW [0 |__| Shits of data to be sent downstream —S—~S 


3.12.5.8 Output Block Status Registers 


R_FswObxStatus[X] describes the state of the replay buffer in output block X. 
Register 
R_FswObxStatus 


Attributes 


-kernel 


Address 
OxE_7D00_00D0 


FE 
A a a 
i; — | OB? Raney bute all y= YOR: Replay buffers Rak buffer is full. 


27:24 | Ob2AckedLsn R OB2: The last LSN that has been acknowledged by the 
Led hatin lll ME F< phlei 
23:20 | Ob2NextLsn R OB2: LSN that fhe output block will use next, when 
OTT Tinting tiene data pace 


Ob1ReplayEmpty iE OB1: Replay buffer is empty. 
| 18 | Ob1ReplayFull Fea eS a eee ee OBI: | OB1: Replay buffer is ful. buffer is full. 


— 14 | Ob1AckedLsn OB1: The last LSN that has been acknowledged by the 
downstream node. 


13:10 | Ob1NextLsn OB1: LSN that the output block will use next, when 
building the next data packet. 


19 | Ob0ReplayEmpty Ee OBO: Replay buffer is empty. 
| 8 | Ob0ReplayFull | {| OBO: Replay buffer is full. 


Ob0AckedLsn OBO: The last LSN that has been acknowledged by the 
Lill easel ll ll IF 2) bese 
3:0 Ob0NextLsn OBO: LSN that the output block will use next, when 
ee [Bie TEE | Tinting thee date pact 


3.12.5.9 Force Error Register 


This register causes the circuit to intentionally produce errors that the fabric switch knows how to detect. This 
will help us to test the error detection logic and error handling software. Any kind of error that we can force 
appears in the register description below. 

Here are the types of errors that we WILL NOT force in hardware, and the reason why we have chosen not to 
do it. I will go through the interrupt cause registers in order that they appear in the text. All OOB interrupts 
are triggered by software actions, so they don’t need a force bit. We don’t have special bits that force counters to 
wrap, because software can simply set the counter to MAX-1 and then force the event to occur once. VC decrement 
errors can be created by software by sending a packet with the DMA engine that has a VC=0 on a route that does 
a decrement. LengthErrMax is difficult for hardware to produce without screwing up the logic, so no force bit is 
provided. Single bit errors can be forced on all replay buffers using or ObFlipMemBits all crosspoint buffers using 
XbFlipMemBits; bit flipping in individual memories one at a time is not supported. 
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NOTE: There is a restriction on forced errors in output blocks. Only one of bits 8:0 (the output block force 
error bits) may be set at a time. For a given type of output block error, you can set WhichOb to all 1’s to generate 
one error in each output block, but you cannot generate different kinds of errors at once. To guarantee predictable 
behavior, after writing ones into WhichIb or WhichOb, do not write the register again until the WhichIb and 
WhichOb bits go down. 

Register 

R_FswForceErr 


Attributes 


-kernel 


Address 
OxE_7D00_002C 


Bit Ecos 
31:30 | ObFlipMemBits These bits are XORed with bits 1 and 0 of every word 
of data being written to every replay buffer. This allows 
software to force single and double bit ECC errors in the 
replay buffers. 
29:28 | XbFlipMemBits These bits are XORed with bits 1 and 0 of every word of 
data being written to every crosspoint buffer. This allows 
software to generate single and double bit ECC errors in 
the crosspoint buffers. 
A COS 
23:21 | WhichIb This field controls which input block will generate the 
error described in bits 16-20. Bit 21+X controls input 
block X. After the error is forced once in an input block, 
the SOE RDONGINS Whichlb bit will be cleared. 
a 
6__[ToComapto | RW [0 |__| Flip bit O of byte 15 of exactly one control packet | 


[IbCorruptCtl_ | 
15:13 | WhichOb RWS This field controls which output block will generate the 
error described in bits 0-12. Bit 13+ X controls output 
block X. After the error is forced once in an output block, 
the HCot pends WhichOb bit will be cleared. 
[Sasa eeed an (Ske eA 
ObCorruptIdleCre tae bit 0 of the CRC field in exactly one idle packet. 
Pee Jere |_| sore only one of bits 8:0 may be set at a time. 
Bae ae [| lip it of the CRC Field in exactly one data packet | 
ObMissingLsn RW Flip bit 2 of the LSN field in Seay one output Bae 
after computing the CRC. (In other words, the packet will 
have bad CRC.) 
ObBadNumFords RW Force the io field to the value 3 in exactly one 
ee enenes TE | ont pret afer computing the CRG. 
ObBadXbeTargetErr | RW Flip bit 0 of the XbeTarget field in Sac one output 
See TE | [prea ater computing te cc 
[3 [ObProtecole _RW_[ 0 | __[ Teave SoP deasserted for exactly one output packet 
fat SC™~—SSC Cd Reserved SCSC—“‘“S*S*~*Y 


1 ObMissingDatavalid | RW Deassert datavalid during the second ford of exactly one 
Ll cl hell al Gl lmesica teclatiainac 
ObLengthErrMin RW Force oP during the second ford of exactly one packet. 


3.12.5.10 Bypass Enable Register 


x 
20: i 
- 


This register allows software to enable/disable each type of bypass mode. This setting affects all XBs and OBs. 
Register 
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R_FswBypassEnable 


Attributes 


-kernel 


Address 
OxE_7D00_003C 


sie, SSt—“—~sSSC~C~‘dYS“‘iL”Od(Rewned—C~—“S*S*~*“—s*S*S*S*S*SCS~“—Ss‘“s*s~*S~S~*S 
Perform error correction and detection in all crosspoint 
buffers. When a packet is read from a crosspoint buffer 
and sent to an output block, the ECC of each FORD 
is checked; this guards against bit errors introduced in 
the crosspoint buffer RAM. (Note: For implementation 
reasons, this ECC logic lives in the output block.) 
Perform error correction and detection in all output 
blocks. Whenever a packet is replayed, the ECC of each 
FORD is checked as it is read from the replay buffer; this 
guards against bit errors introduced in the replay buffer 
RAM. 
Perform error correction and detection in all DMA input 
blocks. When a packet enters the FSW from the DMA, 
the ECC of each FORD is checked; this guards against bit 
errors introduced by the memory system or in the DMA 
TX port register file. 


Enable 5-cycle bypass path 
Enable 4-cycle bypass path 


Enable 3-cycle bypass path 


3.12.5.11 Input Block Data Packet CRC Error Counter 


One per input block. 


Register 
R_FswDataCrcCounter [2:0] 


Attributes 


-kernel 


Address 
OxE_7D00_0020 - 0xE_7D00_0028 


31:0 | Count RW Data Packet CRC error counter. 
This counter counts number of data packets with CRC 
errors. When the counter wraps around, a bit in the in- 
terrupt register is set. 


3.12.5.12 Input Block Idle Packet CRC Error Counter 


One per input block. 


Register 
R_FswIdleCrcCounter[2:0] 
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Attributes 


-kernel 


Address 
OxE_7D00_0090 - 0xE_7D00_0098 


31:0 | Count RW Idle Packet CRC error counter. 
This counter counts number of CRC errors on idle packets. 
When the counter wraps around, a bit in the interrupt 
register is set. 


3.12.5.13 Input Block Good Packet Counter 


One per input block. 


Register 
R_FswPktCounter[2:0] 


Attributes 


-kernel 


Address 
OxE_7D00_0030 - 0xE_7D00_0038 


31:0 | Count RW Packet counter. 
This counter counts number of good (error-free) data 
packets received. When the counter wraps around, a bit 
in the interrupt register is set. 


3.12.5.14 Input Block Poison Counter 


One per input block. 


Register 
R_FswPktPoisonCounter[2:0] 


Attributes 


-kernel 


Address 
OxE_7D00_0040 - 0xE_7D00_0048 


31:0 | Count RW Packet poison counter. 
This counter counts number of data packets which were 
poisoned or dropped by IBX (not packets which had the 
poison type as they entered). When the counter wraps 
around, a bit in the interrupt register is set. 


3.12.5.15 Output Block Control Packet Error Counter 


There are three counters in the three output blocks. R-FswObCrcErrCounter[X] counts erroneous control 
packets in output block X. 
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Register 
R_FswObCtlErrCounter[2:0] 


Attributes 


-kernel 


Address 
0xE_7D00_0050 - 0xE_7D00_0058 


CHAPTER 3. THE DENSE FABRIC SWITCH 


Output block control packet error counter. 

This counter counts the number of times the output block 
has detected a control packet with an error (CRC or loss 
of DataValid). The error counter increments on last byte 
of the control packet, so packets that are too short will 
not affect the count. When the counter wraps around, a 


LEE 


3.12.5.16 Output Block Replay Counter 


bit in the interrupt register is set. 


There are three replay counters. R-FswObReplayCounter[X] counts replay events in output block X. 


Register 


R_FswObReplayCounter[2:0] 


Attributes 


-kernel 


Address 
0xE_7D00_0070 - 0xE_7D00_0078 


Downstream replay counter. 

This counter counts the number of times the output block 
has gone into replay at the request of the downstream 
node. When the counter wraps around, a bit in the inter- 


yt Ed. 


3.12.5.17 DMA Input Block Packet Counter 


One per DMAI block. R-FswDmaiPktCounter[X] counts packets sent from DMA input block X to the FSW. 


Register 


R_FswDmaiPktCounter[2:0] 


Attributes 


-kernel 
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Address 
0xE_7D00_00B0 - 0xE_7D00_00B8 


31:0 | Count RW Packet counter. 
This counter counts number of packets received from the 
DMA. When the counter wraps around, a bit in the in- 
terrupt register is set. 


3.12.5.18 DMA Output Block Packet Counter 


One per DMAO block. R-FswDmaoPktCounter[X] counts packets sent from FSW to the DMA output block 
X. 
Register 

R_FswDmaoPktCounter[2:0] 


Attributes 


-kernel 


Address 
OxE_ 7D00_00CO - —0x_7D00, _0 


31:0 | Count RW Packet counter. 
This counter counts number of packets sent to the DMA. 
When the counter wraps around, a bit in the interrupt 
register is set. 


3.12.5.19 Upstream Control Packet Capture Registers 


These registers allow software to view the control packets sent upstream. R-FswUpCtlCaptureX[Y] captures 
word X of the control packets sent by input block Y. Capture only occurs when software writes the CaptureEna 
bit in R-FswUpCtlWord3[Y]. 

Register 

R_FswUpCtlWord0[2:0] 


Attributes 


-kernel 


Address 
OxE_7D00_01C0 - 0xE_7D00_01C8 


31:0 | Word R x Bytes 3-0 of the latest control packet. Byte 0 is in the 
least significant bits. 


Register 


R_FswUpCtlWord1 [2:0] 


Attributes 


-kernel 
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Address 
OxE_7D00_01D0 - 0xE_7D00_01D8 


31:0 | Word R x Bytes 7-4 of the latest control packet. Byte 4 is in the 
least significant bits. 


Register 


R_FswUpCtlWord2[2:0] 


Attributes 


-kernel 


Address 
OxE_7D00_01E0 - 0xE_7D00_01E8 


31:0 | Word R x Bytes 11-8 of the latest control packet. Byte 8 is in the 
least significant bits. 


Register 


R_FswUpCtlWord3 [2:0] 


Attributes 


-kernel 


Address 
OxE_7D00_01F0 - 0xE_7D00_01F8 


24 CaptureEna | RWS Whenever the CaptureEna bit transitions from 0 to 
1, the next control packet will be captured into 
R_FswUpCtlWord0-3. 


23:0 | Word R x Bytes 14-12 of the latest control packet. Byte 12 is in the 
least significant bits. 


3.12.5.20 Interrupt Cause Registers 0, 1, 2 


The interrupt cause register contains flags which are set when an event occurs, and cleared by software by 
writing a 1 to that bit. The FswIntCause[X] register reflects events that occur in input block X, output block X, 
and DMA input block X. While normally we would like to split these up so that all the bits come from the same 
block, they are grouped together here to reduce the number of registers that software has to read when an interrupt 
occurs. 

Register 

R_FswIntCause[2:0] 

Attributes 

-kernel 

Address 

0xE_7D00_0100-0xE_7D00_0108 
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Ee A 


30 | IbRecvUpTaken RW1C The RecvTaken flag in the R-FswOobUp[X] 
eee EL Trainer antag 
29 | IbRecvUpEmpty RWI1C The RecvEmpty flag in the R-FswOobUp|X] 
eee ET Treg tog” TEE 
eee ET Leena TS ere 

around 


IbPktCount Wrap RW1IC [0 = | ~~‘: FswPktCounter[X] has wrapped around 
IbIdleCrcCount Wrap RW1IC | 0 = [ _[ FswDataPktCounter[X] has wrapped around 


IbDataCrcCount Wrap RW1C ane FswldlePktCrcCounter[X] has wrapped 
around 


IbMissingLsn Missing LSN error. When set, indicates that 
at least once, the LSN of a data packet was 
not equal to the next number in the sequence. 

IbBadNumFords Bad NumFords field error. When set, in- 
dicates that at least once, the NumFords 
field in the data packet header was not 
between FSW_MINFORDS_PACKET and 
FSW_MAXFORDS_PACKET. 

IbVcDecrErr Virtual channel decrement error. When set, 
indicates that at least once, the virtual chan- 
nel decremented below zero and the packet 
was redirected to the DMA. 

IbBadXbeTargetErr Bad XbeTarget error. When set, indicates 
that at least once, the XbeTarget field indi- 
cated a crosspoint buffer that was already oc- 


w 


i) 


eR 


IbProtocolErr Data packet protocol error. When set, indi- 


cates that at least once, SOP/EOP pair was 
not observed. 


IbMissingDatavalid Missing Datavalid during data packet. When 
set, indicates that DataValid signal has been 
observed missing during valid data packet. 
IbLengthErrMin Min packet length error. When set, indicates 
that the EoP pulse arrived before the Num- 
Fords field specified. 

IbLengthErrMax Max packet length error. When set, indicates 
that the EoP pulse did not arrive when the 
NumFords field specified. (Maybe it came 
later, or maybe not at all.) 

Reserved. 


ObReplayFull The replay buffer in OBX number X is full. 


ObRecvDownEmpty RecvEmpty flag in the 
eee | R_FswOobDown{X] register has toggled. 

ObRecvDownTaken RW1C The RecvTaken flag in the 
fein ene | R_FswOobDown{X] register has toggled. 


ObRepDoubleBitErr An uncorrectable error has occurred in the re- 


play buffer in OB[X]. This means two or more 
1 ObRepSingleBitErr 


“I 


a 
aD 


im 


bits were corrupted, and the ECC corrector 
could not fix it. 
A single bit error has occurred in the replay 
buffer in OB[X], and has been corrected. 
ObReplayCount Wrap 
/___ [ FswOCiIEnr Counter X] has wrapped around | 
mp SS~dCSC(“‘(L Od (id Records SC‘~‘~*Y 
W Faw DmaoPRiComter[X] has wrapped aon s28 
W 
2 


em ee Ry Rye me Rl rR iw) i) 
oe Re i) w ol CO} © oO co 


FswObReplayCounter[X] has wrapped around 
X] 
Ma 
DmaiBadNumFords RW1 0 DMA input block has detected a NumFords 


Al a Ala a 
BR} Rl HR mt 
QQrQ QQ Q 
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3.12.5.21 Interrupt Cause Register 3 - For Crosspoint Buffer ECC Errors 


Each of the 15 crosspoint buffers detects single bit ECC errors and double bit ECC errors. Each crosspoint 
buffer sends that information to the CSR module, which sets one bit in this register for each type. Bits 0 and 16 
correspond to XBO0, bits 1 and 17 correspond to XBO1, etc. 


Register 
R_FswIntCauseXbEccErr 
Attributes 

-kernel 

Address 
ee 


[| [Reed SOS—SSCSCSCCCCCCCC*éz 


te 16 EET aT There are 515 bits corresponding to 15 crosspoint buffers. 
If while reading XBmn, two or more bits are corrupted 
in a 64-bit word, bit number (16+4*m-+n) is set. Such 
errors cannot be corrected. 


PP reserved ——SSOOSOSCSCSOSCSCSCSCSCSCSCS 
Sateen Sei There are 15 bits corresponding to 15 crosspoint buffers. 
If a single bit error is found and corrected while reading 
XBmn, bit number (4*m-+n) is set. 


3.12.5.22 Interrupt Mask Registers 


For each interrupt cause register, one interrupt mask register controls which conditions can cause the interrupt 
to be asserted. R_FswIntMask[2:0] enables interrupts for bits in R-FswIntCause[2:0]. R_FswIntMask[3] enables 
interrupts for bits in R-FswIntCauseXbEccErr. All bits are readable/writable, even though there are some bits for 
which there is not (yet) any cause bit. 

Register 

R_FswIntMask[3:0] 

Attributes 

-kernel 

Address 


0xE_7D00_0190-0xE_7D00_019C 


31:0 | IntMask RW If the corresponding interrupt cause bit is ever set, assert 
the interrupt. 


3.12.5.23 Master Interrupt Register 


This register summarizes the four interrupt cause registers, above. By reading R-FswIntMaster, software can 
decide which interrupt cause registers are worth reading. 


Register 
R_FswIntMaster 
Attributes 
-kernel 

Address 
0xE_7D00_004C 
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3l Intr Intr is the boolean OR of all other bits in this register. It is 
driven to the fsw_xxx_Int output port, through the CSW, 
and a few cycles later ends up in the Slow Interrupt Status 
Register in each L2 segment, R-CacxSlIntStat (section 
7.18.9). 


a 
3:0 | WhichIntCause x Each bit of this tells whether there are any un- 
masked interrupt cause bits in one of the four 
Interrupt Cause Registers. Specifically, WhichInt- 
Cause[X] is asserted when any bit in the expression 
(R_FswIntCause[X] & R_FswIntMask[X}) is set. For X=3, 

use (R_F'swIntCauseXbEccErr & R_FswIntMaskj3]). 


3.12.5.24 Model Magic Register 


This register only exists in the high level model. It allows verification code to perform special functions such as 
dumping out the state to a log file. 


Register 
R_ModelMagicF'sw 
Attributes 
-noregtest 
Address 
0xE_7D00_0300 


MagicOp ee dl Write with value 1 to make SystemC dump state to log 
file. 


3.13. Reset and Initialization 


3.14 Internal Data Formats and States 


The data formats for some internal buses are documented here in the spec to help the SystemC and Verilog 
models stay in sync with each other. The only people who would care about these formats are the SystemC and 
Verilog authors. Everyone else can safely ignore this section. 


3.14.1 Encoding of Buses between FswCsr and FswIbx 
3.14.1.1 CsrIbxStat - For csr_ibx_Stat_sa bus 


Class 
CsrIbxStat 


May 14, 2014 165 Rev 51328 


SiCortex Confidential CHAPTER 3. THE DENSE FABRIC SWITCH 


aed 

03 2 pu innsed Drive OSS 
dost] _[OobUpEmpiy |_| Out of band Empty Mag to be sent upstream | 
-d0(40} | OobUpTaken |_| Out of band Taken flag to be sont upstream | 
- d0(39:32] | OobUpChar |_| Out of band character to be sent upstream | 
Paosie} [Woe | i Unused Dre SCSCSC~SSY 


do[8:7] IbNum Tells the IBX its block number: 0, 1, or 2. Purpose: These 
bits will get shifted into the MSB of route, as the route is 
shifted right by two places. 
IbCorruptCtl This bit is the IbCorruptCtl bit ANDed with the WhichIb 
bit in R_FswForceErr. If set, corrupt the next control 
packet and set ForceErrDone. 


JOS]__[Bnablelb [| Fnablethe BOSS 
d0{a| __[ResetibLow |__| Reset the IB. This Signals activelow 
dOBT]_[PktDecVe [|_| Packet Decrement VOSS 
Pao(o] | PktGrckina |__| Packet GRC checking enable SSCS 


3.14.1.2 IbxCsrStat - For csr_ibx_Stat_sa bus 


Class 


IbxCsrStat 


Attributes 


-allowunder 
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d1{63:0] | Offr0O_fsw_InDat_s0a Ef A copy of flr0_fsw_InDat_s0a, for OCLA and Performance 
Counter 


aoe] [WO ised Dave 
QOiE7] | ORwArONewCUPKCsta |__| wt) NewOtPRta SOS 
0{s6) _[OmO-w-SoPs0a____| | fr0-w-SoPWa SSCS 
d0(55]__[ OfOFw-FoP_s0a_ |__| fr0-fsw-FoP-s0a 

do[ad| | Offr0_fswldies0a+'| | fir0fsw_Idle_s0a 

JO{s3|__[OO-fw-DatValsa |__| fr0-fbw-DatVals¥a SOS 
d0(s2|__[ Ofh_fow-MissionMode |__| fr0-fsw-MissionMode 
d0{s1]_[OobUpEmpty | | Out of band Empty flag recehved om upstream 
d0{s0|—[OobUpTaken | | Out of band Taken Tag received from upstream 
d0(39:32] [ OobUpChar | | Out of band character received from upstream 
d0[s1-24] | CuDat | | S bytes of control packet data 
3] __[NewCuPkt | | pulse during frst cycle of control packet 


d0[22 ForceErrDone in the cycle after CorruptCtl causes a control packet to be 
corrupted, ForceErrDone is asserted for one cycle to tell 
the CSR module to clear the Whichlb bit. 


Papi [Woe SS Once Drive SSCS 
aoa] _[TaleCres [Received idle packet ORO anon 
dot] —_[PktMissinghin [| Missing SNerror SOS 
-d0{o] | PktBadNumFords |__| Packet header had bad NumFords eld 
a0] [PV eDenEm || VO decrement amor SS 
0[S|__[ PktBadXbeTargetha | | Packet header had bad XBE target Rol 
d0(7|__[ PktProtocolbvr 


PktLengthMismatch Packet size does not match length field in header. 
do PktMissingDatavalid Datavalid is missing during data packet. 
do PktLengthErrMin Min packet length error 


I 
—4 
— 

d0(3|__ | PktForeeBop [| Max packet Tength enor SSS 
i 
=I 
Ea 


do PktForcePoison Force poison bit error. 


do PktRcevdGood Received good (error-free) data packet. 


PktCrcErr Received data packet CRC error. 


3.14.2 SCB Performance Events 


The following events are trackable by SCB statistical event counting. 


Enum 


FswScbEvent 


Attributes 


-descfunc 


8’h91 FLR2_IDLE Idle from receive link 2 
8’7h92 FLR2_MISSIONMODE MissionMode from receive link 2 


Shsd FLR1_MISSIONMODE MissionMode from receive link 1 
8’h90 FLR2_SOP SoP from receive link 2 


May 14, 2014 167 Rev 51328 


SiCortex Confidential 


OBO_BYP_S1 
OBO_BYP_S2 


OBO_BYP_S3 
FLT1_SOP 


OB1_BYP_S3 
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Bypass $2 granted from OB1 


Bypass $3 granted from OB1 


8’hA8 FLT2_SOP SoP to transmit link 2 


8’7hA9 FLT2_IDLE 
FLT2_MISSIONMODE 


dle to transmit link 2 
MissionMode from transmit link 2 


8’hAB OB2_BYP_S1 Bypass S1 granted from OB2 
8’ hAC OB2_BYP_S2 Bypass $2 granted from OB2 
8’ hAD B2_BYP_S3 Bypass $3 granted from OB2 


SoP from DMA port TX0 


MA_FSW_SOPO 


FAVAILO 
Pl 


8’7hCl DMA_FSW_DATVAL2 
8’hC2 FSW_DMA_BUFAVAIL2 
8’hC8 FSW_DMA_SOPO 


BufAvail from DMA port TX0 
SoP from DMA port TX1 
A port TX1 


DatVal from DMA port TX2 
BufAvail from DMA port TX2 
SoP to DMA Port RX0 


8’hC9 F DMA_DATVALO DatVal to DMA Port RX0O 


8’hCA DMA_FSW_RDYO 


FSW_DMA_SOP1 


Rdy from DMA Port RX0 
SoP to DMA Port RX1 

DatVal to DMA Port RX1 
Rdy from DMA Port RX1 


5 a 


3.14.3. Encoding of Buses between FswCsr and FswDmai 


3.14.3.1 CsrDmaiStat - For csr_dmai_Stat_sa bus 


Class 
CsrDmaiStat 
Attributes 
-allowunder 


dijos0) | Ul ~~*'| | Unused. Drive 0. 


do[63:4| | U0. | | Unused. Drive 0. 
d 


0[3] EnableEccCorr Enable single bit error correction and double bit error 
detection as data is read from the DMA engine 


d0[2] | ResetDmaiLow |__| _ Reset the DMAI. This signal is active low. 
do|1] EnableDmai |_| Enable the DMAI. 
dojo] =| Ubb- sss Unused. Drive 0. 
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3.14.3.2 DmaiCsrStat - For dmai_csr_Stat_sa bus 


Class 


DmaiCsrStat 


Attributes 


-allowunder 


an] 


Odma_fsw_InDat_s0a 


Odma_fsw_Dat Val_s0a 
Odma_fsw_SoP_s0a 
Odma_fsw_EoP_s0a 


BadNum Fords 
IncrPktCount 


DoubleBitErr 


SingleBitErr 


ial 
Lol 
ol 
et 
[ Ofw-dma-BufAvaiLsta |__| 
= 
Lol 


dma_fsw_InDat_s0a 
Unused. Drive 0. 


A copy of dma_fsw_DatVal_s0a, for OCLA and perfor- 
mance counter 


ECC corrector in DMA input block has detected a double 
ECC corrector in DMA input block has detected a single 


3.14.4 Encoding of Buses between FswCsr and FswObx 


3.14.4.1 CsrObxStat - For csr_obx_Stat_sa bus 


Class 


CsrObxStat 
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Ba 
aes) [ur SC*~dCSC~“‘iz ed De SSCSCSCSC“C~“S*~S~S~S 
OpComiptldleGre |__| copy of ObComuptidleCre rom RPawForceEr ——_—| 
Gift] —[ OnCarrupiP Cre | sapy of Ob CorrupiP Oh rom 1 FawForecin 
aT] [ObMissmgian |__| copy of ObMissinglisn from RLFawForeeEnr —————*Y 
C5] | ObBadNumFords |__| copy of ObBadNumFords from RFewForeeinr | 
ATH] _[ ObBadXbeTargetEr |_| copy of ObBadXbeTargetFirr from R_FswForceEr | 
CB]__[ObProtocolkr |__| copy of ObProtocolfirr from RFswForceEr = 
apy [ure Odid sed Dre SCSCS~SCS 
CH] [ObMissmgDatavalid |_| copy of ObMissingDatavalid rom RPawForceEr | 
aro ObLengthrMim |__| copy of ObLengthEnMin from R-PswForeeErr | 
wo Used Deve SSCSC~C~—~SY 


EnableBypS3 Enable ra cycle Bynes path. This is a 4-bit vector. Bit 
53+x enables bypass from IBx for x=0,1,2, or bypass from 
the connected DMAIT for x=3. 

EnableBypS2 Enable 4-cycle bypass path. This is a 4-bit vector. Bit 
49+x enables bypass from IBx for x=0,1,2, or bypass from 
the connected DMAI for x=3. 

d0[48:45] | EnableBypS1 Enable 3-cycle bypass path. This is a 4-bit vector. Bit 
45+x enables bypass from IBx for x=0,1,2, or bypass from 
the connected DMAIT for x=3. 

OobWrite Ask OB to force a gap between data packets so that an Idle 
packet will be sent carrying the new Oob values. It stays 
on until OobWriteAck is sent by the OB. This ensures 
that the Oob channel is never completely starved. 


EnableEccCorrXbData Enable single bit error correction and double bit error 
eee aT detection on data as it is read from the crosspoint buffer 

EnableEccCorrReplay Enable single bit error correction and double bit error 
MMIII, |_| tein data nd fo the ply ur 
[OobDowaEmpiy |_| Out of band Empty fing to be sent downstream | 
OobDown'Taken |_| Out of band Taken fag to be sent downstream 
-OobDownChar |__| Out of band character to be sent downstream | 
a 
[EnableOb |_| Enable the OB. SSCS 
Puod—SSS*d id Used Dive SSCS 
[de ints od. Drive 0. 


CtrlCrcEna |__| Enable CRC checking on control packets 


DriveBadBits Invert bits 1 and 0 of data written to replay buffer, to 
force ECC errors 


| ResetObLow | |. Reset the OB. This signal is active low. 


d0/15:0) | PoolMask ie Pool Mask. 


or 
j=) 


Qa 
j=) 


3.14.4.2 ObxCsrStat - For obx_csr_Stat_sa bus 


Class 
ObxCsrStat 
Attributes 


-allowunder 
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Bit 
d1[63:0| 


Q 
j=) 


Q. 
=) k=) j=) 


63:59] 
58:56 


Q 


55:48] 
47 
46 
4 
44 
43 
42 
41 
40 
39:32 
31:24 
23:20 


Qu) O. 
j=) 


Q. 
j=) 
Or 


Qa. 
j=) 


: 
j=) 


Q 
j=) 


So] O| © 


: 
j=) 


19:16 


15:12 
11:8] 


7 
6] 


i 
=) k=) 


Qa 
j=) 


5 


d0|4] 
d0]3] 


d0.2| 


d 
d 


; 


0[4] 
0[0) 


: 


Mnemonic 


Ofsw_flt0O_OutDat_s2a 


BypassPerfCount 


Oflt0_fsw_CtlDat_s0a 
Ofit0_fsw_NewCtlPkt_s0a 
Ofsw_flt0O_SoP_s2a 


AckedLsn 


NextLsn 


Ofsw_flt0O_EoP_s2a 


XbDoubleBitErr 


XbSingleBitErr 


ReplayFull 


ForceErrDone 


OobWriteAck 


IncrCtlErrCount 


IncrReplayCount 


DoubleBitErr 
SingleBitErr 


Type 


3.14. INTERNAL DATA FORMATS AND STATES 


A copy of fsw_flt0_OutDat_s2a, for OCLA and perfor- 
mance counter 


Bit 56 is high when bypass S1 is granted. 
Bit 57 is high when bypass S2 is granted. 
Bit 58 is high when bypass S3 is granted. 


The last LSN that has been acknowledged by the down- 
LSN that the output block will use next, when building 


In the ObxCsrStat bus going to output block N, bit 12+M 
is set if a double bit error is detected in data coming from 
crosspoint buffer MN. 
In the ObxCsrStat bus going to output block N, bit 8+M 
is set if a single bit error is detected in data coming from 
crosspoint buffer MN. 

In the cycle after one of the FswForceErr bits that affect 
the output block causes a data packet to be corrupted, 
ForceErrDone is asserted for one cycle to tell the CSR 
module to clear the WhichOb bit. 

Acknowledges the OobWrite signal in CsrObxStat. As- 
serted for one cycle when the OobWrite takes effect. 
Error in a control packet. The OB asserts this signal 
for one cycle when a control packet error is detected. If 
DataValid is missing, assert once in the following cycle. If 
a CRC mismatch is detected, assert once in the following 
cycle. Even if multiple errors are detected, only assert one 
time per control packet. 

OB asserts this signal to increment its ObReplay- 
Counter. It is asserted during the cycle in which 
m_FltErrFlag_s2a=1 and m_FltErrFlag_s3a=0. 


The replay buffer has detected a double bit ECC error. 
The replay buffer has detected a single bit ECC error. 


3.14.5 Encoding of Buses between FswCsr and FswDmao 


3.14.5.1 CsrDmaoStat - For csr_dmao_Stat_sa bus 


Class 


CsrDmaoStat 
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Bi 
aes) [Uri Unusod. Dive 0. 
dojsa:t0 [UO |_| Unused. Drive 0. 


do[9:6] EnableBypS1 Enable 3-cycle bypass path. This is a 4-bit vector. Bit 
45+x enables bypass from IBx for x=0,1,2, or bypass from 
the connected DMAIT for x=3. 
do|5 EnableEccCorrXbData Enable single bit error correction and double bit error 
feiecneeacnanais detection on data as it is read from the crosspoint buffer 
EnableDmao [Enable the DMAO. SSCS 
[EnableByp83_____ |__| Enable S-cyele bypasspath——S—S—S~S 
[EnableBypS2___ |__| Enable Leycle bypass path ———SSSSCS 
Wei Used Drive SSCS 
[ResetDimaoLow |__| Reset the DMAO. This signals acthe low. 


3.14.5.2 DmaoCsrStat - For dmao_csr_Stat_sa bus 


Class 
DmaoCsrStat 
Attributes 
-allowunder 


d1[63:0] | Ofsw-dma_OutDat_s2a A copy of fsw.dma_OutDat_s2a, for OCLA and perfor- 
mance counter 


aes [OO tinned Drive OOS 
PORwadmaDatVaLs@a_ |_| TwadmaDatVal@a SSCS 
[OfwadmaSoPs2a |_| BwadmaSoPsda_ SSCS 
OfwadmaRoP_s2a |_| Bwadma-FoPs2a SSCS 
Odmafiw-Rdysla |_| dma Bow Rdy-sla 


2 XbDoubleBitErr Double bit error is detected in data coming from the at- 
pee tl tached crosspoint buffer 

XbSingleBitErr Single bit error is detected in data coming from the at- 

ee || eth emp er ETS 


| d0[0} | IncrPktCount |__| Increment DMA output block packet counter 


3.14.6 Encoding of Buses between FswCsr and FswXbx 
3.14.6.1 CsrXbxStat - For csr_xbx_Stat_sa bus 


Class 

CsrXbxStat 

Paros] [U1 |__| Unused. Drive 0. 

0 
Invert bits 1 and 0 of data written to crosspoint buffer, to 
force ECC errors 


Of6s:is} | Uo | 
d0{17:16] | DriveBadBits an 


do[15:5] Unused. Drive 0. 


[d0ig]__[ BnableXb——[ [Enable the XB SS—SCSOSCCCSY 
[d03|__[ BnableByp83 |__| Enable S-oyele bypass path SSCS 
[d02| | FnableByp82_[ | Enable Feycle bypass path SSCS 
[a0] | BhableBypST |_| Enable S-eyele bypass path, SCS 
[doo] | ResetXbLow |_| Reset the XB. This signal isactivelow.——S—=S 


3.14.6.2 XbxCsrStat - For xbx_csr_Stat_sa bus 


Class 
XbxCsrStat 
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dijo30] | Ul |_| Unused. Drive 0. 


raoiss:0) [WO |_| Unused Drive ———SSSS—SCSCSCSSSSCC—~*r 


3.14.7 Open issues 
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Chapter 4 


DMA Engine Microcode 


by Jud Leonard 
[Last modified $Id: dmauc.lyx 43841 2007-08-28 19:09:39Z leonard §]. 


4.0.8 Package Attributes 
Package 


chip_dmauc_spec 


Attributes 


-dwaccessors 


4.1 Introduction 


The DMA Engine provides a high-bandwidth interface between the memory system and the fabric switch, 
relieving software of the low-level work of repetitively creating packets of memory data and injecting them into the 
fabric, or accepting packets from the fabric and distributing their payload to appropriate locations in memory. 

This chapter describes the functions and interfaces of the DMA Engine which are implemented in microcode, and 
are therefore more or less subject to modification in future revisions of that microcode. The underlying hardware 
mechanisms are described in the DMAEngine spec. 

The DMA Engine is designed to work closely with both privileged kernel-level device drivers and user-level 
library software to provide very low overhead transfers in a protected virtual memory environment. Low overhead 
requires that typical transfers can be initiated and completed without invoking kernel-mode or interrupt-level 
software at either sender or receiver, and that buffers need not be copied. 

The DMA Engine provides two levels of communication between cooperating processes within the system: 


e At the first level, user-mode software creates a small information packet on a command queue in its local 
memory. The DMA engine pulls the packet off the queue and injects it into the switch fabric with addressing 
to deliver it to the desired destination process and error checks to confirm error-free transmission. At the 
destination, the DMA engine stores the packet on a user-accessible event queue for processing by software. 


e At the second level, rather than generating and processing packets directly, software sets up sufficient state in 
the DMA engines at both ends of a transmission to permit the hardware to generate packets at the transmitter 
and interpret them appropriately at the receiver. In this case, the DMA engines at both ends are responsible 
for managing memory addressing, including generation and verification of physical addresses, for fragmenting 
messages into packets, and for reassembly, relieving software of packet-level activity. 


For more information about the MPI (Message Passing Interface) standard, visit http://www.mpi-forum.org. 
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4.2 


Goals 


We’ve tried to make the DMA engine to be as simple as practical, while achieving the following functions: 


4.3 


It should be able to process the packets of outgoing and incoming messages without intervention by software. 
It should be able to keep a modest number of input and output messages in progress concurrently. 


It should dispose of incoming packets it cannot handle by presenting them to software with minimum overhead 
(< 100 ns). 


It should be able to pass outgoing packets from software to the fabric with minimum overhead (< 100 ns). 


It should be able to process several packets concurrently, overlapping multiple memory references. 


It should support local memory-to-memory transfers between address spaces on a single node. It should also 
provide a fast memory zeroing function. 


For large contiguous messages from one node to any other on an otherwise idle network, it should achieve 2 
GB/sec. 


It must protect the integrity of user and kernel processes from unrelated naive, buggy, or malicious user 
processes running on the same system. It is not obliged to protect a user process from kernel-mode software 
on any node, nor from other processes with which it is communicating. It need not prevent covert channels 
or denial of service attacks. 


Differences, Bugs, and Enhancements 


4.3.1 Product and Chip Pass Differences 


1. 
2. 


NEED IMPL: TWC9A records the address and syndrome of DRAM ECC errors, bug2157. 


NEED IMPL: TWOOQA fixes generation of bad ECC when ECC correction disabled and a 32-bit aligned 
packet is read, bug2396. R-SdmaEccMode bit 6 (CifCorrEna) enables ECC correction in CIF. This logic is 
only needed when the microengine does a BRD from a memory address with bit 2 set (32-bit realignment). 
When CifCorrEna is off and the microengine does a BRD from a memory address with bit 2 set, the ECC 
written into the DMA’s internal memory (TX or COPY port packet buffer) is incorrectly forced to zero. Data 
with corrupted ECC may reach the FSW or main memory when the packet is sent. To workaround, leave 
CifCorrEna always set. 


. NEED IMPL: TWCO9A fixes non-correction of ECC during 32-bit realignment operations, bug2403. When 


the CifCorrEna bit is on, and DMA is doing a read with 32-bit realignment, and there is a single bit error 
on the data from the CSW, the RTL does not correct the error. The RTL corrects the error inside the 
DmaCifDatacalg modules, but then incorrectly puts out the uncorrected data on cif_xxx_Data*|63:0] and into 
the next DmaCifDatacalg module. But the ECC bits on cif_xxx_data*[71:64] are the ECC consistent with 
the corrected data, so the resulting data appears to have just a single bit error. Workaround: None needed, 
as the error will be corrected at the destination of the DMA engine. 


. MIGHTFIX: TWC9A might double the size of the instruction memory, bug3390. 
. MIGHTFIX: TWOC9A might fix a performance issue which requires a dead cycle between DMA packets headed 


into the FSW, bug597. 


. MIGHTFIX: TWC9A might fix DmaCif RDIO being corrupted by subsequent WTIO from the same core, 


bug1991. This can cause RDIOs to return corrupted data when followed immediately by a WTIO from the 
same CPU. I/O accesses from different CPUs are not affected, and SPCLs are not affected. When it happens, 
the WTIO overwrites the data before it can be sent back to the core, so the RDIO incorrectly returns the 
data from the WTIO. To avoid this, either issue a SYNC instruction between the RDIO and WTIO, or 
be sure to use the RDIO result before issuing the WTIO. All DMA addresses are affected (RA_DmalImem, 
RA_DmaDmem, RA_DmaApplface0,1, etc.) except for those in the SCB range (RA_SDma*). The bug has 
only been observed when DMA is in the process of doing lots of block writes and the CSW is heavily loaded. 


. MIGHTFIX: Various possible microinstruction enhancements, bug3392, bug3393, bug3394, bug3395, bug3396. 
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4.3.2. Known Bugs and Possible Enhancements 
4.4 Model 


4.4.1 Terminology 
4.4.1.1 DMA Context (formerly Process) 


The DMA Engine is interacting at any time with the six processors on the same node, and each of those 
processors has activities running in user and kernel mode. For this discussion, we’ll refer to each of those activities 
as a DMA context. The DMA engine keeps separate state and control information for each of 14 contexts, so as 
to minimize the extent to which those activities must use mutual exclusion to coordinate activities. There may be 
multiple Unix threads on one or more processors sharing access to a single DMA context. In this case, the software 
must manage concurrent access to the hardware. 

The DMA engine uses a 4-bit context number (called process index for historical reasons) to uniquely identify 
the block of DMA engine state associated with a particular Linux activity. That state includes a 16-bit process ID, 
which can be used by software to uniquely identify the Linux activity which manages the DMA context. Whenever 
it receives a packet, the DMA engine uses the process index to select a block of process state, and compares the 
process ID in the packet to that in the selected state. A mismatch causes the packet to be treated as an unexpected 
packet, and a PID Mismatch event is stored on the event queue for DMA context number 0. 


4.4.1.2 Thread 


The execution model for the DMA Engine is a multithreaded state machine with a thread associated with each 
input or output port. Each thread is activated to process a packet as the necessary resources become available: 
transmit threads wait for an empty transmit buffer, receive threads wait for a full receive buffer. Each port has 
four packet buffers, which spend approximately equal times (~100 ns) in memory references, processing by a thread, 
and moving into or out of the fabric. Queues support communication between transmit and receive contexts, on 
the one hand, and software on the other. 

There are three threads associated with the three input ports, three more with the three output ports, two with 
the copy function (separately for memory read and write), one for queue management, and a specialized thread to 
serve I/O register accesses; total 10. 


4.4.1.3 Handle 


The DMA Engine is accessible to both kernel- and user-mode processes, and it accesses buffers in the virtual 
memory address space of whatever process it is serving. To keep this safe, applications describe accessible memory 
in terms of handles. A handle is an offset into a table of physical memory addresses (called the Buffer Descriptor 
Table, BDT, or the Route Descriptor Table, RDT) approved for use by each process. The tables are writable only 
by the kernel, and the BDT may contain contiguous groups of entries describing virtually-contiguous regions of 
memory. Handles are used to identify buffer regions, commands, and routes. 


4.4.1.4 Packet 


The data transport and switching machinery works on units of data called packets, which are individually 
addressed, carry separate error detection codes, and include up to 128 bytes of user payload. With overhead, 
packets may be as large as 152 bytes. Section 4.9 describes the various packet types supported. 

Packets can be categorized into three major classes: 


DMA Packets carry up to 128 bytes of message data between application-space buffers. 


Command Packets carry instructions to be enqueued and processed by the receiving DMA engine; such com- 
mands are treated as if they had been issued by the receiving process at the destination node. 


Interprocess Packets carry up to 128 bytes of data entirely determined by software, to be stored on the event 
queue of the receiving process. 
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4.4.1.5 Command 


An instruction to the DMA engine, coming from a local processor or received encapsulated in a packet from a 
remote processor. Commands are stored on queues in memory while waiting to be performed by the DMA engine. 


4.4.1.6 Segment 


Messages may be very long — conceivably longer than the physical memory available to a single process. There- 
fore, we recognize that the message passing library software may want to break a single message up into a number of 
segments for independent transmission. The DMA engine hardware is optimized for the case that both source and 
destination buffers for each segment are available when that segment is transferred; that the transfer of an entire 
segment will be along a single path, with packets of the segment delivered in order; that most errors will be detected 
and corrected at the link level; and that uncorrected errors will be infrequent enough to justify retransmission of 
segments as a correction mechanism. 

Segments serve an additional purpose as well: on lengthy transfers, we would like to distribute the traffic among 
disjoint routes from source to destination. The software on the originating node can fracture a message into multiple 
segments and transmit them along available routes to the destination in order to minimize overall message delay 
and hotspot congestion in the fabric. For very long messages, the software will enqueue later segments on the fly as 
earlier ones complete, to shift load to the fastest available path, and to avoid pinning too much memory at a time. 

A segment may consist of a large number of packets, and we don’t want to delay transfer of control information 
between nodes while waiting for completion of a segment, so segment transfers are treated as a background activity 
within the DMA engine; each output port generates packets for pending control transfers (foreground commands) 
in preference to segment transfers (background commands) on the same port. 


4.4.1.7 Errors 


While we recognize that packets will occasionally be corrupted and/or lost in the fabric, we have designed the 
low-level communication hardware to detect and retry corrupted packets, preserving their order, so we expect that 
failures at higher levels will be very rare events, and the system is designed to assume that all packets following a 
common path between any pair of nodes will be delivered uncorrupted in the order they were transmitted. 

Note that the cut-through routing policy implies that a faulty packet may continue to propagate through the 
network, possibly even presented to a DMA engine for delivery at the incorrect destination. The switch is responsible 
for setting the type code of any corrupted packet to “poison”, and the DMA engine is responsible for discarding 
any poisoned packet it receives. 

The system is intended to make packet transfer sufficiently reliable that software can assume a transmitted 
packet will be delivered, and that foreground messages following a common path will be delivered in the order in 
which they were sent. Segment transfers can fail due to BDT faults at the source or destination nodes (indicating 
that a needed page has been swapped out); such faults are reported to software, which is expected to swap in the 
missing page and retry the transfer. 

Software bugs can also prevent received packets from being processed correctly. In these cases, the hardware 
notes the errors in passing, and discards the packet. 


4.4.1.8 Transmit 


Within this chapter, Transmit (abbreviated Tx) is used to refer to the creation of packets and their injection 
into the switch fabric, typically starting in the application as MPILSEND; so the transmit side of the engine is 
connected to the cache’s Read Data bus; this can cause confusion, because of course the engine receives cache data 
to be transmitted through the fabric. 


4.4.1.9 Receive 


Similarly, Receive (abbreviated Rx) is used to refer to the whole process of acceptance, processing, and storage 
of packets coming from the fabric, starting in the application with MPILRECV, and in the fabric with the arrival 
of a new packet; even though the engine must transmit memory addresses and data to the cache to store a packet. 
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4.4.1.10 Multicast 


The DMA engine can be directed to produce several output packets directed to processes on various other 
nodes in response to a received packet, so that a group of processes can quickly inform members of the group 
about collective results. Multicast selectively targets processes so as to reach members of a group quickly without 
disruption to other groups. 


4.4.1.11 Collective 


The engine also implements a decrement and test function which allows another command to be triggered when 
a number of messages have been received; this permits the hardware to collect inputs from several sources and 
transmit when they have all been received. 


4.4.1.12 Copy 


The DMA engine is designed to support communication among application processes, whether they are on the 
same or different fabric nodes. To that end, the hardware supports local transfer of packets without use of the 
switch, but under the same protocol. 


4.4.2 High-level Hardware View 


The DMA Engine consists of a cluster of interacting state machines. The primary application interface consists 
of hardware-managed queues. One set of queues is used by the software to direct fabric activity, and another set is 
used by the engine to distribute incoming packets and completion events to the appropriate processes. The DMA 
engine is able to accept commands directly from any of the processors on the same chip, or indirectly from external 
processes through packets carried over the fabric. 

The DMA engine has virtually no interest in the contents of packets, aside from the Route and the Packet type, 
which specifies the queue or buffer into which the contents are stored. Packet contents are fetched from and stored 
to contiguous blocks of memory. 

All transfers are targeted to designated, pre-established destinations: either an event queue used by software, 
the DMA command queues used by DMA engine hardware, a reserved region of memory called the heap, or buffer 
specifically allocated for the transfer. 

And just as a clarification: the DMA engine is not involved in processing packets which pass through the switch 
on their way between other nodes — it provides the path into and out of the switch fabric, but packets on their way 
from one node to another do not involve the DMA engine on intervening nodes along the path. 


4.4.3 Canonical MPI Transfer Patterns 


MPI provides three basic message transfer forms: Send/Recv, as specified in MPI-1, depends on the active 
participation of application software at both ends of a transfer. One process Sends a message to another process, 
which must perform a Receive to get it. The rules for matching sender and receiver essentially require the matching 
to occur at the receiver. The operation does not depend on the relative time order in which send and receive 
occur. The other forms, specified in MPI-2, are called Get and Put, and are described as single-ended because 
each message transfer is entirely specified by one process (the Initiator). The correspondent (Responder) declares a 
window in memory, and other members of the communicating group are permitted arbitrary access to that window. 


4.4.3.1 Eager Transfer 


For short messages, whether single- or double-ended, our goal is to complete the transfer with a minimum 
of overhead. Library software on the sending node queues a command to the local DMA engine for immediate 
transmission of a Enq_Direct packet which identifies the communicator, sender’s rank, tag, and the data. Upon 
arrival at the remote destination, the remote DMA engine pushes the packet payload onto the event queue of the 
receiving process. 

If the receiving process is waiting on a posted receive, the receiving process interprets the packet immediately. 
Otherwise, the packet is interpreted by a dedicated fabric processor, if there is one, or as a last resort, by a kernel- 
mode interrupt-level device driver. The receiving software is responsible for matching the communicator, rank, and 
tag of the packet with a posted receive, if there is one, and otherwise for storing the information to match against 
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later receives as they’re posted. In eager transfers, the receiving software must copy the message contents to the 
destination buffer. 


Initiating Node Switch Fabric Responding Node 


Tx Que 


Send short message — Enq_Direct Packet(s) 
Event Que 


y= Receive 


For intermediate-sized messages (too large for a single packet), software may choose to use Put_Im_Hp commands 
to copy from a buffer in the source application to the heap of the destination process, prior to notifying the receiver 
of message availability through an Enq_Direct. 


4.4.3.2 Single-ended Messages 


Once both ends of the communications link have set up buffer descriptors to describe the communications buffers, 
one-sided messages, get or put, may be used to move the data. If the sender initiates the transfer, a Put_BfBf 
command is used, if the receiver initiates the transfer, a Send_Command containing a Put_Bf_Bf command is sent 
to the transmit-end DMA Engine. 


Put_BfLBf waits in a transmit queue for access to the output port required by its route. When it reaches the 
head of the queue, it generates a sequence of DMA packets. When the DMA packets arrive at the receiver, the DMA 
Engine there places their contents in memory at the specified address. A special DMA_END packet terminates 
the transfer, at which point the receiving DMA Engine can execute a string of commands to signal software of 
completion, or store a fault event to signal failure. 


4.4.3.3. Rendezvous Exchange 


The sequence for Send/Recv transfer of a long message consists of an initial handshake called a rendezvous, 
in which the nodes agree that both are ready for the transfer to take place, with appropriate buffers available in 
memory and hardware resources for controlling the transfer. 


The rendezvous exchange consists of a single Enq_Direct packet from the sender to the receiver in which the 
sender notifies the receiver of the existence of the message; its communicator, rank, and tag; and the BDT handles 
describing its buffer. When the receiver finds a matching receive, it performs the equivalent of a single-ended Get 
to transfer the message, except that the sender’s DMA engine reports a completion event to the sender. 


The rendezvous provides sufficient information for the sender and receiver to agree on the alignment of pay- 
load data within packets; the receiver acknowleges successful, error-free receipt of message segments, or requests 
retransmission of the segment in the event of a timeout or uncorrectable error. 


For very long transmissions, the endpoints may agree to transfer several segments concurrently along disjoint 
paths, distributing the traffic around any hotspots. 


The rendezvous exchange enables very efficient use of hardware, compared to a software-mediated (eager) 
transfer, but requires an additional trip to set up. 


May 14, 2014 180 Rev 51328 


SiCortex Confidential 4.4. MODEL 


Send long message > Rendezvous Request 
en Event Que 
Eng_Direct Packet 


Initiating Node Switch Fabric Responding Node 


Tx Que 


y= Receive 


Match Send & Recv 
Rx Que 


Rendezvous Respons ae 
2 Eng_Tx Packet 


DMA Packets 
SSS Event Que 


<riawe 
| 


Ack (optional) S| 
Event Que eS ae 
ea 2 Enq_Direct Packet 


Rendezvous transfer described To transfer a long message using MPILSEND/MPIRECV, the sequence re- 
sembles the following: 


The sending application process calls MPILSEND. 


The sending MPI library decides that the message length is great enough to justify rendezvous protocol (a 
compile-time parameter). 


The sending MPI library builds a Send_Event command which describes the communicator, sending rank, 
and tag of the message, along with a buffer handle and offset for the user’s message buffer. This information is 
collectively called a rendezvous request. The library code pokes the DMA engine to tell it there’s a command 
on the command queue. 


The DMA engine pops the Send_Event command from the process command queue, and translates its route 
handle to determine which output port should be used to reach the receiving node. If the foreground context 
for that port is available, the command is enabled for immediate output; otherwise it is copied to the port- 
specific transmit foreground queue for transmission as available. 


The Send_Event command results in delivery of an Enq_Direct packet to the receiving node, where it is 
matched to the target DMA context and stored on the event queue of that context. 


At some time either before, during, or after all the above, the receiving application process calls MPILRECV. 


The receiving MPI library searches the lists of previously-unmatched Sends. If there is one whose communi- 
cator, rank, and tag match the parameters of the current receive, the match is made, and the receiver initiates 
a Get_Seg sequence, described below. If there is no match, the parameters of the current receive request are 
stored to be matched against future sends. 


The receiving MPI library processes the event queue. If it finds a send (either rendezvous or eager), the 
library searches the lists of posted receives to find a match. 


Once a match has been made, the library software at the receiver builds a Put_BfBf command, which 
consists of two parts: the information needed by the receiver context to accept DMA packets for the transfer; 
and information needed by the sending node to build those DMA packets. Library software enqueues the 
Put_BfBf command inside a Send-.Cmd command. 


The receiver’s DMA engine sends an Eng_Response packet to the sender, carrying the Put_Bf.Bf command 
to be executed to perform the transfer. 


When the Enqg_Response arrives at the transmitter, it is enqueued to be performed when reaches the head of 
the background queue for the appropriate port. 
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e The sending node generates DMA packets as rapidly as the switch fabric can accept them, and the receiving 
node stores them in the destination buffer according to the receive context. 


e Upon successful completion of the transfer, the receiver performs an optional command string. 


4.5 Queues 


The software interface to the DMA Engine consists of a page of control registers which are used by the kernel’s 
device driver for configuration setup and diagnostic purposes, plus a set of control pages through which the library 
software requests activities by the engine, and through which the engine reports completion of requests and arrival 
of new messages. The hardware supports concurrent interaction with 14 DMA contexts, so there are 14 separate 
control pages as described in Table ?? below. 

The control pages provide a multiport interface to a hardware queue manager which schedules the activities of 
the fabric input and output links. It accepts commands from fourteen DMA contexts and responses from three 
input links, distributing them into separate queues for each of the output links. It also provides a queue bypass 
function which avoids memory writes and reads in the (common) event that the target port is idle. 

The memory area allocated for queues should be large enough to make queue overflow very unlikely, but the 
hardware will discard any received packet destined for a queue which doesn’t have room for it. It is up to software 
to ensure that queues do not overflow; we expect that quotas will be used to ensure that there is space for every 
queue entry. Each process is allocated a quota which determines the maximum number of commands it may have 
in the port queues at any time; any commands in excess of that limit remain in the process command queue. 

For simplicity of software (but not minimal memory use) all queue entries are 128 bytes, a multiple of the L2 
cache block size, and are allocated aligned to cache blocks. This avoids issues of false sharing between entries. 
Software writes queue entries on the command queue by writing the entry in main memory. The hardware is 
informed of the update by a write to a special I/O register. Hardware then reads the command block to see which 
output port it needs. (See Figure 5.6) 

The block is copied from the command queue, where it was written by software, to the port if idle, or to the 
selected port queue. 

Port threads are responsible for pulling commands off the port queues as earlier commands complete. A 
specialized thread, called the queue manager, accepts commands as they are written by software, sorting them into 
the appropriate port queues or inserting them directly into available slots for use by transmit threads. 

Each queue is described by a set of three values accessible to the kernel: 


1. The memory region used for a queue is described by a buffer descriptor (see paragraph 4.7.4) with the physical 
address in bits 35:0, and the negative length of the region in bits 63:36. 


2. The read pointer is the physical address of the next item to be removed from the queue (the head of the 
queue). If the queue is empty, the read pointer matches the write pointer. 


3. The write pointer is the physical address at which the next item should be inserted in the queue (tail). 


Both read and write pointers are incremented by 128 until the pointer reaches the end of the memory region, then 
it wraps back to the beginning of the region before reading or writing the next entry. The region descriptor length 
should be a multiple of 128. 


4.5.1 Command and Port queues 


The command queue is the mechanism by which applications software directs operation of the DMA Engine. To 
send a message, the software writes one or more commands, indicating the location of the data to be used (by buffer 
descriptor index, offset, and length), the destination (by route handle), and linkage to appropriate completion notice. 
Software notifies the DMA Engine of an addition to the command queue, using an I/O write to fastCmdHdr in 
DmaApplface0 or cmdQWrSize in DmaApplfacel, and the DMA Engine either executes the command immediately 
or transfers the entry to the appropriate port queue. For single-packet message transmission, the command queue 
item typically contains the entire packet payload; microcode translates the route handle to obtain the routing 
information, assembles a packet, and appends a check code before injecting the packet into the fabric. 

Software can directly add to the command queue on the local node. Those commands include the ability to 
enqueue commands at remote nodes as if they had been initiated by software on that node. This feature is used 
for single-ended operations and broadcast, among other purposes. 
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Figure 4.1: Command and Event Queues 
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Each process has one command queue which provides access to all the port queues. A transmit command 
may use the foreground queue for short control messages; a foreground command takes priority on the output 
port it needs, and gets sent as quickly as possible, but in order with respect to other foreground commands on 
the same port. The DMA engine thread for each output port services the transmit queues for that port on a 
foreground/background basis, servicing foreground transmits in preference to background. 

Interface software must exercise care not to overrun pending commands on the command queue, and because 
commands for different ports may be serviced out of order, neither cmdQRdPtr nor cmdQRdOffset is a reliable 
indication of where the oldest pending command is. Software should use Supervise commands to determine how 
much of the command queue region is free. 


4.5.1.1 Process quota 


The port queues are shared among all processes on a node, so it is important to prevent bugs in one process 
from interfering with another; in particular, we must prevent overflow or saturation of the port queues by one 
process from damaging another. Therefore, each process is given a quota representing the maximum number of 
commands it may have in the port queues at any time, and the port queue regions must be sized to permit the full 
quota allocated to all processes in each of the port queues. 

The DMA engine suspends processing of the command queue of any process which has reached its quota of 
commands in the port queues, and commands received from remote nodes for such a process are enqueued on the 
event queue rather than the port queue. Library software is expected to copy such deferred commands to the 
command queue, keeping them in order. The DMA engine maintains a count of the number of commands deferred 
in this way, and continues deferring remote commands to the event queue until all deferred commands have been 
enqueued to the port queues. 


4.5.1.2. Command order 


The DMA Engine provides a limited set of assurances about the order of command processing: 


e Commands from a single process, sent out a single transmit port, will be sent in the order in which they are 
queued, except that background commands (Put_Bf_Bf) may be delayed with respect to newer foreground 
commands (any others). 


e Foreground commands in a string invoked by Do_Cmd or a receive completion and directed to a single transmit 
port will be performed in order, but not necessarily ordered with respect to the command queue. 


e Commands for multiple contexts or directed out different transmit ports are not ordered. 


Combined with the assurances by the fabric of reliable, in-order delivery of packets following the same route and 
virtual channels, these conditions are sufficient for the software to ensure consistent ordering of messages where 
necessary. 


4.5.2 Event queue 


The event queue is the mechanism by which the DMA engine notifies software about completion of commands or 
errors which prevent completion, and also one of the mechanisms by which software on one node can communicate 
with another. Software can select whether the queueing of events raises an interrupt request (see paragraph 4.7.7). 
Typically, an entry on the event queue indicates that the transfer described by a transmit or receive context is 
complete, or that a remote process has sent a short message directly to this one. 


4.5.2.1 Hardware-generated events 


e Buffer descriptor invalid 


e Unmatched Process ID 


Heap/BDT/RDT index out of bounds 


e Diversion for port queue quota 


Segment completion at transmitter/at receiver 
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In general, fault events are delivered to the event queue which belongs to the local process which encountered 
the fault. When a Put_Bf.Bf command encounters a tx buffer descriptor fault, the transmitting node sends an 
Enq_Direct packet whose payload is stored at the receiver as a SegAbort event. 


The first word of an event queue entry contains the event type in bits 11:8: 
Enum 

DmaEventType 

Attributes 

-allowlc -kernel 


Definition 


A heap handle exceeds the heap size. The bad heap handle is stored 
in d1[31:0] 


A4dl 
4’d2 A route handle exceeds the RDT size. The bad route handle is stored 
in d1[31:0] 


43 bdtFault At the receiver, a buffer handle exceeds the BDT size, a buffer descrip- 
tor length is too short for the offset requested, or a buffer descriptor 
is marked read-only. The swBucket is stored in d1[63:0]. 

4’d4 cmdFault An illegal command code was encountered. Either the command is 
undefined, or it was inappropriate to be issued as a fastCmdHdr. The 
command header is stored in d1[63:0]. 

4’d5 segAbort Reported at the receiver when the transmitter aborted the segment. 

[ee eee | The swBucket is stored in d1[63:0]. 

4d6 pidMismatch | A received packet contained the wrong process id for its selected pro- 
cess index. The packet header and trailer are stored in dl and d2 on 
the process 0 event queue. 

Vd7 queueFault | Software error setting up command or event queue pointers. This 
event is stored on the process 0 event queue. The process index of 
the failing process is stored in d1. 


4’d8 deferredCmd | Process received more commands from remote nodes than allowed by 
the port quota; any excess are stored on the process event queue. This 
event queue entry contains, in dl up to d14, the payload (a nested 
command) of an Enq-Response packet which could not be pushed 
onto the port queue. 


rxEndSeg | Successful end of segment at receiver. d1 contains swBucket. 
4’d10 portFault The txPort hint in a command header differs from the port specified 
by the route descriptor in the RDT. The command header is stored 
in dl. 


Event queue entry Class 


DmaEventQueue 


do[7:0] | eventLength|  ===————s«X|: The “useful length” of the event queue entry, in bytes 
oo DnaBvent Type 


Pa0[esI2[ | reserved [Sid eros SSSSCSC—~—CS—C—C—S—SSCSCS 
Pdi [63:0] _[eventData_[ | Information specific to the event type, as desertbed above | 
Event queue entries are written 128 bytes apart, to keep the pointer management as simple as possible. The 

event length field indicates the number of bytes of the entry which were actually written by microcode. 


4.5.3 Summary of DMA Engine Queues 
To wrap up the section on queues, Table 4.1 is a list of all the types of queues that DMA engine interacts with. 


Below the table are some notes on the commands or events which are found in each queue. [from Bryce: When 
commands and events are more completely defined elsewhere, some of this should be removed.] 
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Table 4.1: DMA Engine Queues 


TF one per Dma Context 


1a: one per Dina Context 
transmit foreground port queue four: one per TX port, plus copy 
transmit background port queue four: one per TX port, plus copy 


Commands can contain: 


e command type 

e data to set up transmission 

e raw packet data (can contain nested commands for remote DMA) 
Events can contain: 

e event type 

e info about a transfer that completed or failed 

e info about an unsolicited packet that arrived 


e raw packet data 


4.6 Modes of Operation 


The DMA Engine hardware needs attention from a programmable processor at the beginning and end, and 
occasionally in the midst, of a message transmission. Under various circumstances, the processor selected to do the 
work might be the one running the application process, one dedicated as a fabric support processor, or an interrupt 
service routine in a designated processor. We distinguish these cases as modes because the literature refers to 
heater mode, communication processor mode, etc, to describe similar configurations, but unlike other cases in the 
literature, our system switches among the modes freely for optimal performance. 


4.6.1 Synchronous mode 


The conceptually simplest form of communication between MPI processes is syncronous mode, in which the 
sender creates and sends a message, waiting to proceed until it has been received, and the receiver declares an 
available buffer for the message, waiting until it has been filled. 

In syncronous mode, the processors used by the communicating processes are essentially idle while the commu- 
nication is going on, and are therefore the ideal candidates to perform any support and supervisory work required 
by the DMA hardware. In the current vision, that includes on the transmit side: maintenance of data structures, 
confirmation of error-free transmission, and timeout monitoring. On the receive side, it includes selection and 
scheduling of message segments; communicator, rank, and tag matching (CRT match); management of unexpected 
message buffers; maintenance of data structures, and timeout monitoring. 


4.6.2 Asynchronous mode 


Synchronous mode MPI communication allows no overlap between computation and communication, so MPI 
also provides asyncronous versions of both Send and Receive to permit the programmer to initiate one or several 
message transfers, conduct independent calculations, and then wait for completion of some or all of the transfers. 
Those portions of the transfer which take place when the application has finished its calculation and is waiting are 
treated as synchronous, in spite of having been initiated with the asynchronous calls; but for the remainder, we 
don’t want to slow down the application by interrupting it to service the message. 

Therefore the preferred mechanism for dealing with asyncronous message service is to designate one processor as 
the “fabric processor”, and run it in a spin loop monitoring the input/event queues for all the others, and servicing 
traffic for each as it comes in. 
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4.6.3 Interrupt mode 


Of course, there are times when there’s nothing to do but compute, and lots of it. During those times, we would 
hate to have 1/6 of our compute capability tied up as a fabric processor, so we will return the fabric processor to 
the scheduling pool and handle any rare requirements for DMA Engine service as interrupt requests directed to a 
designated processor. 


4.6.4 Fabric Processor 


During those times that the system dedicates one processor on a node as the fabric processor, it will run a 
process which has mapped the heap, event queue, and buffer descriptor table of each application process into its 
own address space. The fabric processor and application processor interlock access to the event queue by means of 
shared variables in the heap to ensure that exactly one of them services every event. 


4.6.5 Virtualized mode 


It would be a desirable feature if the software were able to multiplex the limited hardware resources among 
a larger number of processes, so that descheduled processes (in the Unix sense) were still available to participate 
asynchronously in MPI communications. We have had some preliminary discussions about this possibility, but have 
not resolved all the protection issues involved. Two models have been discussed: 


e Multiplexed applications are linked with a different library, which calls a daemon or kernel service to com- 
municate, in a manner similar to MPI over TCP. 


e Multiplexed applications timeshare a DMA Context for command and event queues, but external traffic is 
actually directed to a kernel-mode driver which demultiplexes to the appropriate address space. [How to 
handle remote commands?| 


At the moment, this feature is mostly pipe-dream, but if we can devise a reasonable implementation, it would be 
desirable. 

The simplest implementation of virtualization is provided by the currrent specification: to share the hardware 
resources, the operating system kernel stops all the processes of a job, waits for the job’s current traffic to quiesce, 
and reassigns the hardware resources to the processes of a new job. 


4.7 Communication state 


Communicating processes may have a very large number of simultaneously-outstanding message requests; it 
is up to the MPI library or equivalent software to schedule message activity, and provide the DMA Engine with 
descriptive information about each active message. 

In the descriptions which follow, unused or unspecified fields in commands and registers should be initialized as 
zero. 


4.7.1 Transmit state 


The DMA Engine maintains for each output port some transmit (Tx) state in a hardware structure which 
describes an outgoing segment during its transfer: a sequence of packets, the buffer from which they are read, and 
their destination, which typically consists of a route to a node and a receive context id on that node. (Table ??) It 
is loaded by the transmit thread, which assembles the various components from the command, the buffer descriptor 
table, and the route descriptor table. When the transmit state has been loaded, the transmit thread is able to 
create packets and inject them into the fabric. When a complete segment has been transmitted, a new command 
is popped off the transmit queue. 

When a transmit command is executed by the DMA engine, the Route Handle is used to lookup a route in the 
kernel-controlled route table, and the Buffer Handle is used to obtain the base address and length of a physically- 
contiguous region of the buffer. That region may not be as large as the message segment; if it runs out before the 
end of the segment, the DMA engine hardware increments the Buffer Handle to obtain a new BDT entry in which 
to continue the segment. The engine also clears the offset, so that subsequent packets will come from the beginning 
of the next region of buffer. 
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The DMA engine needs storage for 8 separate transmisions in hardware: foreground (bypass) and background 
(bulk) contexts for each output port plus the copy thread. The transmit queues of waiting commands are kept in 
memory queues associated with each port. The port-specific queues are written by the queue manager and read by 
the port threads as hardware space become available. 


4.7.2 Receive state 


Every DMA packet carries a 64 bit control word, which contains a buffer handle (2 bytes), a buffer offset (4 
bytes) and a notifier (2 bytes). To work efficiently, the microcode implements a buffer descriptor cache with lookup 
faster than loading the BD from memory for each packet. This design makes it impossible to carry from one Rx 
buffer handle to the next in the middle of a segment. Software will arrange that DMA packets are full cache-line 
aligned at the receiver, and segments do not cross page boundaries at the receiver, so this won’t be needed. 

Segment transfers can fail because of BDT faults at transmitter or receiver. An attempt to access an invalid 
buffer descriptor or to write beyond the end of the buffer descriptor will be detected at the receiver. The receiver 
will set a bit in the heap selected by the notifier, and discard the packet. At the end of the segment, the transmit 
microcode will send a DMA_END packet, which causes the receiver to test the heap for an earlier error. If the 
transmit end faults due to a bad buffer descriptor, an ENQ_DIRECT packet with a Seg_Abort event will be sent 
to the receiver. 


4.7.3 Notifiers 


DMA commands include a 16-bit field, called the notifier, which is used by software to uniquely identify a 
segment transfer. In the event of a bdt failure at the receiver, the rxNotifier is used to remember which segment 
failed, and upon completion of the transfer, an entry is created on the local event queue, including the notifier of 
the failing DMA command and the bdt index responsible for the failure. 


4.7.4 Buffer descriptor 


Translates a process virtual address range to a contiguous physical address range. Used to describe message 
buffers, get/put windows, and queue rings. Contiguous groups of entries are used to describe contiguous regions of 
virtual address space which may be discontiguous in physical memory. 

More particularly, each DMA Context has a register representing the starting physical address and length of the 
buffer descriptor table for that context (see Table ??). The Buffer Descriptor Table (BDT) contains 8-byte entries, 
which contains the starting physical address of a buffer and the length of the buffer in bytes. A Buffer Handle, 
which appears in DMA command queue entries, is a 16-bit unsigned integer less than the BDT size; the hardware 
multiplies it by 8 and adds it to the bdtRegion pointer to identify a specific BDT entry. 

A single BDT entry describes a region of memory which is contiguous in both virtual and physical address 
spaces; it is not necessarily restricted to a single page, though of course such a restriction is sufficient to ensure 
contiguity. 

Each BDT entry is valid if its length field is negative. The DMA engine will abort transmission of a sequence 
which uses a BDT entry in which the length field is positive or zero. The engine will generate an event queue entry 
for the local DMA Context to indicate the BDT entry fault, and will not perform any command string associated 
with the successful completion of the command. 

On the transmit side, a segment is permitted to wrap off the end of a buffer descriptor and into the next; this 
is not allowed on the receive side. 

The physical address specified by a buffer descriptor must be aligned to a 64-byte boundary (low 6 bits zero). 

Bit 0 of a BDT entry may be set to 1 to indicate that the buffer is read-only; use of such an entry for a receive 
buffer will cause a bdtFault. 


Buffer Descriptor Class 
DmaBufferDesc 
Attributes 
-kernel 
Fa05:0_[ physAddress |__| Physical address of start of baller (address must be G&-byte aligned) | 


d0[63:36] | len =~ | ~——_—[_ Length of physically-contiguous region 
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Figure 4.2: Buffer Addressing 


DMA Engine 
DMem Registers Buffer 
Descriptor 
Table 
Process 
Index 
Buffer 
BDT Region 
handle offset \ 
Descriptor packet 
payload 
segment packet 
length payload 
packet 
payload 
"Yi 


Bits 5:1 of physAddress must be zero to ensure 64-byte alignment of data references. Bit 0 is not interpreted as 
an address bit, but if set restricts the buffer to read access only (DMA transfers cannot write; bdtFault is reported 
instead). 

It is the current plan to make all BDT entries describe one page of memory - 64KBytes. This is large enough 
for efficient segment transfer and makes the VM management problem much easier. 


4.7.4.1 Virtual Memory swapping 


The user processes which depend on the DMA engine for communication services are ordinary Linux processes. 
As such, some pages of their virtual address space may not be in memory. Many interprocessor communication 
systems deal with this problem by requiring pages with active buffers to be “pinned”, so that they cannot be paged 
out. This requires an explicit system call to pin and un-pin the buffers, or constrains the program to fit in the 
available memory and keep the entire data space pinned. We have chosen instead to assume that active buffers are 
in physical memory, and provide an escape mechanism for the rare cases in which that fails. 

When the kernel in any SMP decides to swap out a page, it has to ensure that all processors have invalidated 
the page entry in their TLBs; in our system, it must also invalidate any corresponding entries in the BDT, and 
invalidate the BD cache in the hardware. 


4.7.5 Route descriptor 


The Route Descriptor Table (RDT) contains routing directives to get from this node to a specific Unix process 
on another node, typically by three disjoint paths. Route descriptors are protected from modification by the user; 
they are accessed by handles like buffer descriptors. A Route Handle, which appears in command queue entries, is 
a 28-bit unsigned integer less than the RDT size; it identifies a specific RDT entry as an offset relative to the RDT 
region. 

Each process has a register representing the starting physical address and length of the route descriptor table 
(RDT) for that process (see Table ??). It is a software decision whether RDT’s are shared among processes. Each 
RDT entry is 8 bytes: 32 bits of routing directives, 4 bits of starting virtual channel number, a 16-bit process id 
on the destination node, and a 4-bit index which identifies the hardware process associated with the destination 
process id. The Route Descriptor also contains a 2-bit field identifying the output port associated with a path, so 
that a command using it can be stored on the appropriate transmit port queue. 

All packets are given a path to their destination node and process at the time a command is enqueued in the 
source node’s DMA Engine. The path is described by a string of routing directives, one per switch, indicating the 
output port to use on that switch. After selecting the output, each switch shifts the routing directive right two 
bits, discarding one directive and exposing the next for use at the next switch. Upon arrival at the destination 
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node, the process id in the packet is compared against that of the context selected by process index to determine 
the context in which the packet should be treated. 


Route Descriptor Class 
RouteDescriptor 
Attributes 
-kernel 
agro) [Port |__| Output port used for thispath 
a0fTts]_[virtChan | [Tnitial virtual channel ———=—SOSCS—SCS 
Fa0[5:12] [processIndex [| Remote process dex SCS 


GO|gL-16] | procosslD) [| Remote process id 
d0[63:32| | path =| ~~: Routing string for switch fabric - shift right each hop 


Descriptor Cache The DMA engine caches up to 128 route descriptors and buffer descriptors. Any time that 
software modifies the RDT or BDT, it must write the corresponding handle to routeHd|Prefetch or bufferHdl- 
Prefetch, respectively, in the DmaApplfacel for the corresponding context to keep the cache coherent with the 
table in memory. 


Broadcast We considered creating a broadcast mechanism in the switch, so that a broadcast packet received on 
any input port would be replicated on all the outputs, until a time-to-live counter expired. We abandoned that 
approach for several reasons: 


e While it works extremely well in a perfect Kautz graph, it becomes very messy if there are any dead links or 
nodes in the graph, or if there are non-Kautz topologies in the system. 


e The packet contents must be the same everywhere, so there is no way to individually identify the target 
process(es). As a result, each node must decide whether there is any appropriate target process for each 
broadcast message. 


e The requirement to replicate a packet to all output ports significantly complicated the switch design, which 
associates each packet buffer with an input/output crosspoint. 


Instead, the DMA engine has provision for accepting a command (Do_Cmd) which directs the transmission of 
several output packets to software-selected destinations, allowing the construction of multicast trees with software- 
selected fanout, targeting specific processes at each destination node, and creating no new requirements for the 
switch fabric. 


4.7.6 Heap 


There are a number of data structures shared between the DMA engine hardware and the library software, which 
may be running on an application processor or the fabric processor; those structures need to be accessible to both, 
but the hardware uses physical addresses, while the software uses virtual addresses. To resolve this difference, we 
use a region of memory (called the Heap) which is user-writable and contiguous in both physical and virtual address 
spaces, and we refer to objects in that space by means of offsets (handles) within the heap. Such objects include 
communicators, the temporary values and fanout commands used by barriers and collectives, and unexpected eager 
messages. 

Objects in the heap are referenced by handles, which are checked against the size of the heap, which is controlled 
by the kernel. A handle which exceeds the size of the heap results in a heap handle failure, which will be reported 
on the event queue of the local process. 


Reserved Heap for Notifiers The first 8K bytes of the heap (addresses below 0x2000) are reserved for use by 
the DMA Engine, and must be zero at initialization time; microcode uses bits in that area to record DMA receiver 
buffer descriptor faults until they have been reported on the event queue. 
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4.7.7 Protected data structures 


The hardware presents three interface pages for each of the 14 DMA Contexts it services (one writable by 
user mode, one user readable and writeable, and one writable only by the kernel). In addition, the kernel has 
direct read-write access to other information stored in the DMA Engine DMEM. Access control is managed by the 
processor’s virtual address translation hardware. 

It is a choice for software whether and how to use the kernel processes defined by the hardware, but the option 
is available to assign one to each processor on a node, so that packets and interrupts can be delivered to a dedicated 
processor (a fabric processor, for example) rather than an interrupt routine on one processor which might then have 
to notify the scheduler on another, for example. The hardware simply presents the pages as described in Table ?? 
— software may map them as needed. [Additional process id’s might be useful if we want to deschedule a process 
while it communicates or while a priority process runs.] [Offsets are more or less arbitrary, and may change.] XXX 
NOTE the DmaProcCtlStatus page is actually just process specific storage cells in DMEM, refer to the DMEM 
map. The Write-Only items in DmaAppIFace are addressed by stores through the DMA External I/O addresses 
in RA_DmaApplfacel + (ProcessIndex * 0x10000) + offset {where the L2 cache hardware converts them to SPCL 
operations on the CSW}. The RO and RW items are addressed through RA-DmaApplface0 + (ProcessIndex * 
0x10000) + offset using load (RDIO) and store (WTIO) operations. 

Class 

DmaProcCtlStatus 

Attributes 

-kernel 


PaEETE[ | processID | _|__ KRW | 16-bit process 0, unique within node 
{H[68:0] | counters [KRW Sixteon “bit countors for use by collectives 
cmdQuota || KRW | Max queued commands for this process, mis T | 
[-a3[63.0] | deferredCat_[ [KRW | Neg number of remote commands deferred by quota 
[-aafe8-0] | event Region |_| KRW | Region containing Process Event Queue 
y-a5(63.0] [event QRaPtr_| [KRW | Event Queue Read (head) pointer 
{6]63:0_—[ event WP | [KRW Event Quene Write (tail) pomter 
a7[68.0]_| heapRegion [| KRW | Region containing Library Heap 
{8[63:0]—[-emdQRegion [KRW Region containing Procoss Command Quai 
d9(63:0]) | cmdQRdPtr [ === =| = =KRW __{ Command Queue Read (head) pointer 
Fa0(63:0) [omg Pe [ [KRW] Command Oeue Write (al) pomter 
[-ani[63.0] | batRegion [| KRW | Region containing Buffer Descriptor Table 
F a2 (68:0[-[rdtRegion | [KRW | Region containing Route Descriptor Table 
[aTa[ii.o| | venthntCause || KRW Tnterrapt cause code when event is queued 
FaTBI2] [ eventintTarget | [KRW | Bas stop number to which Interrupt is delivered —| 


[Doublewords d4 through d12 are of type DmaBufferDesc.| 


Class 
DmaApplfacel 
Attributes 


eventQRdSize Written by application to indicate size (in bytes) of item 
taken from event queue 

cmdQWrSize Written by application to indicate size (in bytes) of new 
commands 


routeHd1Prefetch WO Written by application with an RDT handle to preload or 
ee Tite the one cache entry a that ot 
bufferHd|Prefetch Written by application with a BDT handle to preload or 
pene oe a invalidate the buffer descriptor cache entry at that offset 
Software must write routeHd|Prefetch or bufferHdlPrefetch following any change to the RDT or BDT, respec- 
tively, to ensure that the change is recognized by the Dma Engine. The value written is the offset in the RDT 
or BDT of the updated entry. If the offset exceeds the size of the selected table, no update occurs, and the Dma 
Engine increments qmgrErrorCnt. 


Class 
DmaApplface0 
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Attributes 


CED 
d0|63:0] | eventQRdOffset Event queue read pointer offset within region 
d1[63:0] | eventQWrOffset Event queue write pointer offset within region 
d2/63:0] | cmdQRdOffset Command queue read pointer offset within region 


d4[63:0] fastCmdHdr WO Header doubleword of Send_Event or Send_Cmd for fast 
feo) eee i ee | launch; a copy of header on command queue 
Restriction Registers in DmaApplIface0 and DmaApplfacel must not be accessed while the DMA Engine has 
any threads disabled, or while the countdownHalt bit is set. Doing so can hang the processor. [The restriction on 
disabled threads does not currently apply, because we do not use mutex locking in the ioAccess microcode thread.] 


The value written to fastCmdHdr must be the same as the header doubleword of the next command on the 
command queue, otherwise operation of the command is unpredictable. 


Command Queue The command queue is written into memory by software. The first word of the command 
contains the payload length, by which the hardware can know how many bytes to read to complete the command. 
The DMA engine copies it either directly to the appropriate port, or to the appropriate queue for the required 
port, where it will be serviced in order. The length of every queue entry is always 128 bytes, which need not all be 
written by the processor. 

Application software has two means by which it can notify the DMA engine of a new command on the command 
queue: 


e By writing a multiple (N) times 128 to cmdQWrSize, software indicates that N new commands have been 
added to the queue. 


e By writing the header doubleword of a command to fastCmdHdr, software indicates that one new command 
has been added to the queue. This function works only for SEND_EVENT, SEND_CMD, and PUT_IM_HP 
commands, and requires that the txPort field in the header is set correctly. In typical circumstances, this 
mechanism allows lower-latency processing of the command. 


Command Quota _ The kernel assigns a quota to each process for the number of commands that it may have 
in the port queues at any time. Both local and remote commands are charged against that quota. When the 
quota is reached, the DMA engine stops accepting commands from the process command queue, and any received 
commands are copied to the event queue rather than a port queue. Any time a received command is sent to the 
event queue, the deferred count is incremented, and all further received commands are sent to the event queue until 
the deferred count returns to zero. [This is to keep received command processing in order] Software must set bit 
16 in the header of any deferred command in the command queue, so that the DMA engine knows to adjust the 
deferred count. 

The value in the cmdQuota register should be initialized to one less than the maximum number of outstanding 
commands allowed to the process; zero indicates that the process is allowed only one command at a time. 

Do_Cmd can execute a string of commands. Once that string is started (implying that cmdQuota is positive), 
it is enqueued in its entirety, even if doing so drives cmdQuota below zero. Therefore, the port queues must be 
sized to accomodate a number of commands at least equal to: 

[(cmdQuota + (execLimit/128)) * number_of_processes] 

Figure 4.3 outlines the treatment of command quotas and the deferred count. 


Interrupt Cause Register 13 (EventIntCause and EventIntTarget) will not cause an interrupt if zero. Bits 11:8 
select an interrupt cause register at the processor selected by EventIntTarget, and bits 7:0 overwrite any interrupt 
cause value previously in that register. Because there are only 8 interrupt cause registers per processor, bit 11 must 
be zero. 


4.7.8 DMA Engine Common Control/Status 


The Common Control/Status variables, the contents of which are listed in Table ??, are used by the DMA 
engine to manage the transmit queues for each output link. These values are typically initialized at boot time and 
otherwise ignored by software. 
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Figure 4.3: Command Quota and Deferred Count 
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Each queue is described by three doublewords, the first of which specifies the physical address and length in 
bytes of the memory used by the queue; the second contains a write (tail) address, and the third a read (head) 
address. These three doublewords are used directly by the DMA engine, but are not accessible to the application; 
the application sees only offsets from the beginning of the queue region, so it is unaware of relocation of the queue 
by the operating system when the process is paged out and back in. 


Note everywhere a DmaQDesc occurs, the address is a byte address and the length is the negative byte length. 


Software should refer to these variables through the names, as defined in the dma.load file; the assignments to 
specific dmem offsets are subject to change. 


Class 
DmaQDesc 
Attributes 


-kernel 


physAddr d0[35:0] Queue region physical address 

len d0[63:36] | Queue region negative length in bytes 
wrAddr d1[35:0] | Queue write (tail) address 

wrLen d1[63:36] | Queue write negative length hint 


rdAddr d2[35:0} | Queue read (head) address 
rdLen d2[63:36] | Queue read negative length hint 
Type 


QmerErrorCnt Oxe78 | | Count of commands ignored 
|_| Max allowed length of Do_Cmd string (in bytes) 
|__| microcode version number 


PortQRegion | 44*16- tbg*8 | ‘| Queue region physical address and length 
PortQRdPtr | 45*164 tbg*8 | ——*|: Current read (head) pointer for each port queue 
PortQWrPtr | 46*16- tbg*8 | |: Current write (tail) pointer for each port queue 
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Figure 4.4: Data Formats 
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4.8 Commands 


The DMA engine receives commands as blocks of data, either written by a processor on the same node, or as 
the payload of a received Enq_Response packet. In either case, bits 63:32 of the first doubleword of the block are 
interpreted as a route handle, which is evaluated in the RDT of the current process to determine what queue and 
path to use for the packet. The RDT entry determines the port that will be used by the command, and hence the 
appropriate queue for holding the command until the port is available. 

Enum 

DmaCmdType 

Attributes 

-kernel 


Command encoding is chosen to make command codes match the packet types they send, insofar as possible, 
with the valid packet types all coded with even parity (probably not necessary, but we still have plenty of code 
space...). 

Software must inform the DMA Engine of any new command by writing that command to the next available 
128-byte block of the command queue, and either: 


e (Standard method) Write the number of new commands times 128 to the I/O register called cmdQWrSize, or 


e (fast path for one command only) Write the header of the new command to the I/O register called fast CmdHadr. 


4.8.1 Command Header 


The first doubleword of every command has a uniform structure, shown here: 
Class 

DmaCmdHead 

Attributes 


-kernel 
Se mai pd igh 
PaO] [pide |__| Reserved for process mde 
10:16] [count |__| Do-Cind counter selector _———S—=S 
25:20] | _countTotal |__| DoOmd counter reset value= 


25:24] | txPort | = | Output port (hint) to be used for command 

50:28] | reserved [| Reserved; mast be zero 
(| [deferred | | Be to inioate defered rence command | 
63:30] | TouteHlandle | [Route handle for path To destination 


4.8.2 Send_Event Command 


The Send_Event command instructs the DMA engine to create and send an Enq_Direct packet, whose payload 
will be stored on the event queue at the destination process. If it isn’t processed immediately, a Send_Event 
command waits on the Tx_fg (foreground) queue. 

Class 

DmaCmdEvent 

Attributes 
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d0[63:0] header Command type is Send_Event 


4 
-a1[63:0) [control |__| Reserved must be zero 
a 


d2[63:0] First payload data 


The length field in byte 0 of the header encodes the payload length, which must be a multiple of 8 and between 
8 and 112 bytes. 

Software may optionally use the “fastCmd” mechanism to perform a Send_Event command, saving significant 
overhead if the required output port is idle. 


4.8.3 Send_Cmd Command 


The Send_Cmd command instructs the DMA engine to create an Eng_Response packet, with a payload to be 
processed as a command at the destination node. If it isn’t processed immediately at the source, the Send_Cmd 
command waits on the source Tx_fg queue. The Send_Cmd command contains a nested command as its payload; 
that nested command will be interpreted at the remote node as if it had been issued by the receiving process, 
but the nested command must not be Send_Cmd or Supervise. The nested command in the Enq_Response packet 
determines which command queue (Tx_fg or Tx_bg) at the remote node receives the command; the route handle 
in the packet payload determines the RDT entry selected, and thus the output port selected at the destination. 

Class 

DmaCmdSendCmd 

Attributes 

does] [header |__| Command types Semmd——=—SSC~S~S~*S 
[ai63:0|_[ control |__| Reserved: must be ero 
€2[63:0| | payloadHead [| Payload, a nested command to be enqqusned at the receher | 
d3[63:0|_|_payloadCtl |__| Payload; control word of nested command | 
[63:0] | paylondPayT [| Payload of nested command SSS 
d5[63:0|_| payloadPay2 |_| Payload oF nested command 
d6(63:0| | paylondPay3 [| Payload of nested command ——SSSS—S—S—S 
d7[63:0|_| payloadPayd |_| Payload oF nested command 
d8(63:0| | paylondPays [| Payload of nested command SSS 

[d0[63:0| | payloadPayé_| | Payload of nested command 

Fa10/63:0[ | payloadPayT |_| Payload oF nested command SSS 

Fdi1[63:0) [payloadPay8 |_| Payload of nested command SS 


-a12(68:0| | payloadPay9 |_| Payload of nested command __———SSS—S—S—S 
-a13 (63:01 [payloadPayI0 |_| Payload of nested command SSS 
- a14[63:0] [payloadPaylt | | Payload of nested command —————SSS——*d 
15 (63:01 | payloadPayT2 |_| Nested command payload contmues up to 12 doublewords_| 


The length field in the header is variable; it gives the length of the nested command, including its header and 
control word. The header of the nested command also has a length field which can be at most 96 bytes. 

Queueing of the nested command at the destination is controlled by the cmdQuota and deferredCnt process 
variables at the destination. If the deferredCnt is non-zero, or the remaining cmdQuota is negative, the nested 
command is pushed onto the event queue with an event type that indicates it is a deferred command, and the 
deferredCnt variable is incremented. Otherwise, the command is processed as if it had been pushed onto the 
destination node’s command queue by software on that node, and the quota is decremented. 

Software may optionally use the “fastCmd” mechanism to perform a Send-.Cmd command, saving significant 
overhead if the required output port is idle. 


4.8.4 Do_Cmd Command 


The Do_Cmd command instructs the DMA engine to perform a string of commands which will be found in the 
local heap. The string of commands must not include Do_Cmd commands. Each command in a string contains its 
own route handle and command code, which together determine the queue on which that command waits. 

Class 

DmaCmdExecute 

Attributes 
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do[11:0} | header [| ~—__| Command type is Do_Cmd 
aO{19:16[ | —eountld [| Counter selector 


{0[23:20] | countTotal [| Counter reset vane 
d1[31:0| | execHandle |__| Heap handle for first command 
d1[63:32| | execCount | | Number of bytes in command string 


The countlId field identifies one of 16 4-bit counters associated with the target process; Do_Cmd decrements 
that counter. If the starting value of the counter is zero, the value is replaced by the contents of the countTotal 
field and the commands specified by execHandle are enqueued. If the starting value is non-zero, the decremented 
count is saved and the specified commands are ignored. 

The counters for each process are in the counters register in DmaProcCtlStatus; the counter selected by coun- 
tId=0 is in bits 3:0 of that register; the counter selected by countId=15 is in bits 63:60. Counter 0 may be implicitly 
accessed by the successful completion of a Put_Bf_Bf command; it is ordinarily left containing 0. 

The execHandle is a byte offset in the heap at which the first command will be found; execCount is the number 
of bytes of the commands, and each command is 128 bytes long. ExecCount must not exceed the value in the 
ExecLimit register, controlled by the kernel. When Do_Cmd executes a command string, all the commands are 
enqueued, and cmdQuota is decremented for each, regardless of the sign of cmdQuota; the port queues must 
therefore be sized to allow ExecLimit space after cmdQuota is exhausted. 

Do_Cmd executes foreground commands for each transmit port in the order specified in the command string, 
but there is no order guarantee with respect to the command queue, background commands, or other ports. 

Unused fields in the header word (length and route handle) must be zero. 

Do_Cmd must not be issued on the “fast path”; doing so results in a cmdFault. 


4.8.5 Put_BfBf Command 


Put_Bf_Bf commands instruct the DMA engine to create and send a sequence of DMA packets to the remote 
node; the packet payload is taken from a buffer identified by a buffer handle. Put_Bf.Bf commands wait on the 
Tx_Bg (background) queue for the availability of a transmit context. The implication is that while most commands 
following any given route are completed in the order in which they were enqueued by software, Put_BfBf commands 
may be delayed with respect to other commands to the same destination. Put_Bf_Bf will never be started before 
completion of previously-queued commands which use the same route. 

Class 

DmaCmdPutBf 

Attributes 
P_eader [|__| Command types PucBEBE Imgih2——SSOS~S~S~S~«*« 
[“segbength |__| Segment lngthm bytes SSCS 
[execRouteHlandle | Route handle to notify of successful completion, oF vero W Tocal | 
[_txOffset |__| Byte offset from local buffer descriptor base ——SS—S 
TBufferHlandle[—_[Toeal Buffer Deseriptorindex SSCS 
[—_txNotifier |__| Segment identifier for transmitter SS 
[_ixOfset [Byte offset from remote buffer descriptor bases 
[axBuierHandle[__| Remote Bulfer Descriptor index SSCS 
[_axNotifier [| Segment identifier for recover. ———S—SCSCS 
[—swBucket__[__|[ Available space Tor data delivered to receiver event queue | 
[execHandle |__| Heap offset of receive-completion command string 
/—exeeCount [Length of recive-completion command string ————S—s 


TxOffset and RxOffset define the starting byte address of the destination and source buffers with respect to 
buffer descriptors selected by txBufferHandle and rxBufferHandle on the remote and local nodes, respectively. The 
calculation works as follows: 

The txBufferHandle is extracted and multiplied by 8; the result is added to the sending process BdtRegion 
pointer, where the source buffer descriptor is found. The txOffset (which must be a multiple of 4) is added to the 
address in the buffer descriptor to give the starting address of the source buffer. 

At the receiver, a similar process interprets the rxBufferHandle and rxOffset in the destination process environ- 
ment, except that the rxOffset must be a multiple of 32. 
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In the event of a buffer descriptor fault (a buffer handle is too large, implying a buffer descriptor outside the 
bdt region, or a buffer descriptor with length of zero) the transfer is terminated, and an BdFault (rx) or SegAbort 
(tx) event is stored on the event queue at the receiver. There is no direct notification of the transmitter, even if 
the fault occurs there. 

On the transmit side, the offset plus length of a segment may be allowed to exceed the maximum address implied 
by the Tx Buffer Descriptor. In that event, the transmitted data runs to the end of the specified region, then any 
excess comes from the region specified by the next buffer descriptor. No such continuation is permitted at the 
destination. 

Upon successful completion of a segment, the receiver tests the execRouteHandle. If zero, the receiver executes 
a string of commands, as described by the execCount and execHandle (same as Do_Cmd). If non-zero, the receiver 
builds a Do_Cmd containing the execCount and execHandle, and sends it as if it were in a Send_Cmd with that 
execRouteHandle as the route. 


Note to software implementors: Library software must be prepared to deal with source and destination buffers 
which may have different alignment. The hardware is designed to handle the most common cases, but there are 
several conditions which require special handling by software: 


e Ifthe destination buffer does not start and end at cache block (32-byte) boundaries, software must use another 
mechanism (probably built upon Send_Event) to deliver the data which belongs in partial blocks. 


e If the destination buffer does not start at a 64-byte boundary, the transfer will make most efficient use of 
the memory bandwidth if the library uses a short segment to achieve 64-byte alignment for the bulk of the 
transfer. 


e If the source and destination buffers do not have the same alignment, the starting offset in the source buffer 
should be specified to align the first packet to a 64-byte boundary at the destination. 


e If the source and destination buffer alignments differ by an amount which is not a multiple of 4, the alignment 
must be adjusted by a software copy before or after the transfer. 


DMA Completion 


Success 


Tx Fault Rx Fault 
Store SegAbort Store BD Fault 
Event at Receiver Event at Receiver 
w/ swBucket w/ swBucket 


Store local Send_Event 
End_Seg Event w/ swBucket to 
w/ swBucket execRouteHandle 
y y 
Locally, Send_Cmd to 
Do_Cmd w/ execRouteHandle 
execHandle & Do_Cmd w/ 
execCount execHandle & 
execCount 


4.8.6 Put_.Im_Hp Command 


Put_Im_Hp commands instruct the DMA engine to send a packet to the remote node; the packet payload 
comes directly from the command and is written to the remote heap. Put_.Im_Hp commands wait on the Tx_fg 
(foreground) queue for the availability of an output port. 

Put_Im_Hp sends a single Wr_Heap packet, whose payload comes directly from the command and is written to 
the remote heap. 

Class 

DmaCmdPutImHp 
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Attributes 
d0[63:0] | header [| | Command type Put_Im_Hp 


d1[31:0| | heapHandle | ———_—i|- Heap offset at destination (aligned 64) 
d2[63:0] | payload [| Initial payload doubleword (first of up to 14) 


The length field in byte 0 of the header gives the length of the payload in bytes. It must be a multiple of 8, and 
the payload will be extended with zeros to the next 32-byte boundary when it is stored at the destination. 

Software may optionally use the “fastCmd” mechanism to perform a Put_Im_Hp command, saving significant 
overhead if the required output port is idle. 


4.8.7 Supervise Command 


The Supervise command provides control mechanisms for management of the DMA engine. It serves as a marker 

which writes its payload to the local event queue when all earlier foreground commands for a selected port have 

been sent. The marker is intended to provide library software with a reliable indication that space in the command 

queue and/or heap is available. 
Note that completion of DMA transfers is generally reported by an endSeg event; Supervise is useful for flushing 

the commands in the transmit foreground queues. 
Class 
DmaCmdSupervise 
Attributes 
| d0[63:0] | header =| ~~‘ | Command type is Supervise, length 16. Port to mark is specified by txPort 25:24 
rar[6x0) | control | [Reoved —S—CS—SSC“‘“‘“S*S*S™S*S™C™C™C~™S 
-2[65:0) [payload |_| Fist payload data doubleword, copied to Event queue a7 SSS 
-a3[63:0) [payload | [Second payload data doubleword, copied to Event quexe 1 


The Supervise command selects an output port using bits 25:24 of the header, and stores its payload on the 
local event queue after processing all earlier commands for the same output port. 

Supervise evaluates d0[63:32] as a route handle, like other commands, even though it will be used only to verify 
that the port selected by the header matches that selected by the route. 

Supervise may not be nested inside Send_Cmd. 

Software may optionally use the “fastCmd” mechanism to perform a Supervise command, saving significant 
overhead if the required output port is idle. 


4.8.8 Undefined Commands 


Command codes which have not been defined otherwise result in a cmdFault event being stored on the event 
queue of the context in which they occur. 


4.9 Packet formats 


4.9.1 Packet header and check 


Packet sizes are multiples of 8 bytes, so that packet boundaries correspond to symbol framing boundaries on 
the link. Each 8-byte unit is referred to as a “ford”; see Matt for derivation and justification. Data packets consist 
of four or more fords, up to 19. The first ford of every data packet (the header) contains a routing string, a virtual 
channel number, a buffer index for the next switch, and a link sequence number for error recovery; the second 
ford, called the control word, is interpreted by the receiving DMA engine to control where and how the payload is 
stored; the last ford, the trailer, contains the packet type, a 20-bit identification code for the target process at the 
destination node, a CRC checksum, and 8 constant bits (which are the translation of the “comma” symbol used to 
mark the end of the packet). See Table 4.2. The control word may or may not be present; the hasCtl flag in the 
header is set if and only if the word is present. 

Idle packets consist of a single ford marked by a comma symbol which is used only by Idle. The remaining bits 
may be used for diagnostic or out-of-band information and a CRC checksum. 


May 14, 2014 199 Rev 51328 


SiCortex Confidential CHAPTER 4. DMA ENGINE MICROCODE 


Table 4.2: Packet Header and Trailer 


Switch 
RDT 


26:22 | Switch 
DMA 


Link Seq No 31:28 | Switch 
6x32 | RDT 


Non-idle packets need a type field, to control their interpretation, and a process id, which must match that 
assigned to the receiver by the kernel. This is to prevent confusion when processes are rescheduled or moved 
between processors, and to prevent rogue processes from examining or modifying unrelated process memory. 

[We still need to define any required debug and performance monitoring features.] 


4.9.2 Packet Types 


Table 4.9.2 lists the defined packet type codes. Any packet received with an undefined type is reported as an 
error and discarded. [Currently, all valid packet types have even parity, to make it that much more difficult to 
mistake a corrupted packet. Next we should use the odd-parity codes which are distance 3 from poison. Seems 
excessive, at this point, but we have plenty of codes still.] 

Enum 

DmaPktType 


4’b0011 ENQ_DIRECT Push packet payload onto software event queue 


4’b1001 ENQ_RESPONSE 


4’b1010 WR_HEAP 
4’b1100 DMA_END 
4Vb1111 POISON Discard packet 


4.9.3. Direct Transmission: Enq_Direct 


Short messages, consisting of one or a few packets, are sent by the sending process constructing a command 
with a route handle and the contents of the desired packet, whose payload is deposited on the event queue of the 
receiving process. The event queue is processed by software at the receiver. 

Table 4.3 shows the form of a packet whose contents will be deposited on the event queue for processing by 
software; similar packets are available to store to the DMA Engine’s command queue. Another form stores to the 
heap, using the control doubleword to specify a heap offset. 

Enq_Direct packets are generated by Event commands. 

Event queue entries are all 128 bytes. The DMA engine writes the packet payload, and fills to the next 64-byte 
boundary with zeros. 


4.9.4 DMA 


DMA packets are the heavy truckers of the SiCortex fabric. They carry the high-volume message traffic between 
cooperating nodes which have set up matching transmit and receive contexts. In addition to the payload and the 
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Table 4.3: Direct Queue Packet Fields 


Size (bytes) 


| Header | == 8 ~——«| DMA Engine | As defined in Table4.2 
EConieal | 0 | eae 


| ———s«|:s Skipped; hasCtl=0 
Payload 8-112 For use by software 
| Trailer | =8 ~~ [| DMA Engine | As defined in Table 4.2 


header /checksum overhead carried by all packets, DMA packets carry a control ford which tells the receiver’s DMA 
Engine where to store the payload in the destination buffer. The format of the control ford is shown below: 

Class 

DmaCmdCtl 


Attributes 
it Mnemonic | Definition 


d0{31:0] Byte offset. of packet payload with respect to buffer descriptor 
d0[47:32] | bufferHandle | Index into BDT for buffer descriptor (multiply by 8) 
d0[63:48] Bit index into heap for error flag 


See Table 4.4. 


Table 4.4: DMA packet fields 


Size (bytes) 


| Header | == 8 ~———«| DMA Engine | As defined in Table 4.2 
| Control | == 8 ~~‘ [| DMA Engine | As defined above, in class DmaCmdCtl 


Payload | S18 
| Trailer | ===8 ~~ [| DMA Engine | As defined in Table 4.2 


Message buffers are not necessarily aligned with respect to cache blocks, at either the transmitting or the 
receiving node, but the DMA engine requires that a received DMA packet must be aligned so that its payload 
starting address precisely corresponds to an integral number of L2 cache blocks (64-byte boundary). Therefore, the 
transmitting node’s DMA engine may be required to form packets from up to three cache blocks, with alignment 
at any 4-byte boundary; library software is obliged to use Enq_Direct packets to pass data at the beginning and 
end of a message which do not align to a cache block boundary. 

The DMA payload length is permitted to be less than a multiple of 32 bytes; in that event, the receiver will 
extend the payload with zeros to the next larger 32-byte boundary 

When a DMA packet is received, the receiver uses the buffer handle (*8, for a byte address) to obtain a buffer 
descriptor from the process BDT. The buffer offset is added to the descriptor base address to obtain the address 
at which the payload is stored. In the event of a fault, the payload is not stored, and the microcode sets a flag 
in the heap (bit number rxNotifier mod 8 in byte rxNotifier / 64 of the heap). The flag is tested and cleared by 
a DMA_End packet when the transmitter finishes the segment; if set, the receiver stores a bdtFault rather than 
rxEndSeg. 

DMA packets are generated by Put_BfBf commands. 


4.9.56 DMA_End 


A DMA_End packet is sent following the final DMA packet of a segment to mark successful transmission. It 
contains sufficient information to allow the receiver to store an EndSeg or bdtFault event on the event queue, and if 
the transfer was successful, optionally activate a string of dependent commands at the receiver (if execRouteHandle 
is zero) or a remote node as specified by the execRouteHandle relative to the receiver’s RDT. 


4.9.6 Wr_Heap 


Wr_Heap packets are used to write the Heap communication area allocated by the target process. 

Under some circumstances, the sender of a short message may choose to use Wr_Heap packets to transfer the 
message data to the destination node before matching SEND with RECV, so that software at the destination can 
copy the data once a match has been made. 
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Table 4.5: DMA End packet fields 


Size (bytes) 


| Header | = 8 ~__| DMA Engine | As defined in Table 4.2 
| Control | ==8 ~~ ‘| DMA Engine | Notifier and BD Handle as in DmaCmdCtl; execRouteHandle in 31:0 


Payload0 | === 8 =~ |) Command _| Software “bucket” 
Payloadl | ==8 ~~ [| Command | Exec Handle and Count 
| ‘Trailer | ==8 ~~ | DMA Engine | As defined in Table 4.2 


Table 4.6: Wr_Heap packet fields 


Size (bytes) 


| Header | = 8 ~~‘ [| DMA Engine | As defined in Table 4.2 
Offset [8 Cid) Start offset within Heap 


Payload 8-112 Data to be written to destination heap 
| Trailer | == 8 ~~ [| DMA Engine | As defined in Table 4.2 


Offset must be a multiple of 64; length must be a multiple of 8. Writes to the heap always modify one to four 
aligned 32-byte blocks of memory. Memory beyond the last doubleword of payload is zeroed to the next 32-byte 
boundary. 

Wr_Heap packets are generated by Put_Im_Hp commands. 


4.9.7 Eng-Response 


A Get request for a large message becomes an Eng_Response packet, created by the Initiator as part of a Receive 
command. When the initiator is ready to receive a segment, an Enq_Response is sent from the initiator to the 
responder (Table 4.7), containing a Put_Bf_Bf command to be used at the remote (responder) node. The command 
is processed by the DMA engine at the responder, subject to the same access constraints as if the entry had been 
placed on the command queue by local software. 


Table 4.7: Enq_Response Packet fields 


Size (bytes) 


| Header | === 8 ~~‘ | DMA Engine | As defined in Table 4.2 
Control | 0 Skipped; hasC=0 


Payload 16-112 Response command executed at destination 
| Trailer | == 8 ~~ [| DMA Engine | As defined in Table 4.2 


Typically, an Eng_Response packet contains a Put-_Bf.Bf command which directs transmission of a segment, 
but there are valid uses of other command types. 

When an Enq_Response packet is received by a responder, the responder checks the cmdQuota and deferredCnt 
variables for the target process. If the cmdQuota is exhausted (negative) or the deferredCnt indicates there are 
previously-deferred commands awaiting service, the response command in the packet is pushed onto the target 
process event queue with code deferredCmd, and the deferredCnt process variable is adjusted. This is to prevent 
remote commands from overflowing the port queues. Library software associated with the process must recognize 
the deferred command and copy it to the command queue, setting bit 31 (deferred) in the header. 


4.9.8 Poison 


Poison packets are not intentionally generated by the DMA Engine, and are discarded when received. Any 
packet may be converted to a poison packet if some link along its path detects a CRC error. That link will request 
retransmission, but the corrupted packet may already have left the station, so the poison type code causes it to be 
ignored. 
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4.10 Notes on Complex Functions 


4.10.1 Rendezvous 


Rendezvous is the handshake sequence executed between a pair of processors planning to use DMA packets to 
pass a large message; it gives both participants the information needed to set up transmit and receive contexts. 

Rendezvous is initiated by software injecting a rendezvous request as an Send_Event in the command queue. The 
request contains communicator, source rank, and tag. It also carries buffer alignment information. The initiating 
node sends the request to the responding node, where the DMA engine stores the packet in the event queue so 
that software can find a matching receive. Once the match is found, the responding node issues either a Put_Bf_Bf 
command or a Send_Cmd containing a Put_BfBf, which produces a stream of packets. Upon completion of the 
segment transfer, the receiving node stores an endSeg event and (if successful) processes its completion command 
string. 

There are substantial performance consequences from appropriate scheduling of segment transfers at Ren- 
dezvous; blindly queueing transfers in a FCFS order may result in severe hotspot congestion. It is up to software 
to reorder transfers for optimum performance. 


4.10.2 Stride and Scatter/Gather 


MPI specifies mechanisms by which the application can build messages that correspond to non-contiguous 
memory at the sender and/or receiver. The early plans for the DMA Engine included direct support for such 
messages, but they created a problem in that a packet which requires many main memory references may take 
much longer to service than its occupancy in any other stage of the communication pipeline; this creates the 
prospect of a message of such packets backing up the network in undesirable ways. Therefore, the fabric processor 
should be used for assembly and disassembly of non-contiguous messages, either by copying the data to and from 
contiguous buffers which are then transferred via rendezvous send/receive, or by transfer of convenient-sized chunks 
using directly-queued packets. 


4.10.3. Barrier and Collective 


A rough model: nodes in a communicator are organized in a tree (branching rate to be determined by experi- 
mentation) with a root, intermediates, and leaves. As each node reaches the collective operation: 


e Leaf nodes send their contribution (using Wr_Heap packets; see Table 4.6) to pre-allocated heap cells in their 
immediate parent, an intermediate node. 


e Intermediate and root nodes gather the contributions of their children and the local process. This can be in 
software, spin-waiting for completion, or using the counting facility in Do_Cmd to initiate transmission of the 
result toward the root. 


e When the contributions from their leaves have all arrived, intermediate nodes send a group contribution to 
their parent (again using Wr_Heap packets). 


e When the root receives all its contributions, it broadcasts the collective result to the entire communicator, 
using multicast. 


Reduction operations which require arithmetic (sum, max) must defer to software for the arithmetic, but may 
choose to gather several layers of inputs through such a tree before invoking software to perform the reduction. 


4.10.4 Multicast 


The early design included a mechanism called Exploding Broadcast as part of the fabric switch; that approach 
has been abandoned for reasons outlined elsewhere. Current plans provide for a multicast mechanism in the DMA 
engine, implemented with ordinary point-to-point packets (carrying an Execute command) which can stimulate 
execution of multiple commands, sending output packets to software-selected destinations. 
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4.10.5 Out-of-band 


The switch interface includes six pairs of registers corresponding to byte-wide send and receive paths to and from 
the immediate adjacent nodes on each of the switch input and output ports. Each register carries a byte of data 
plus a handshake bit. When a node writes its send register, the send register’s handshake bit is cleared, and sets 
again after software in the remote node reads the corresponding receive register. The remote node’s handshake bit 
is cleared when the byte arrives in the receive register, and sets when software reads the register. This mechanism 
is used by software in the early stages of configuring the fabric and booting the operating system, and remains 
available for any purpose required by software during normal operation. 


4.10.6 Receive Matching 


We looked for a way to match MPILSEND with MPILRECV in microcode, so that the rendezvous could be 
turned around without software intervention. We were unable to devise a satisfactory solution, and for the moment 
at least, it’s not under consideration. 


4.10.7 Initialization 
This subsection will describe the process of initializing and starting the DMA engine in preparation for use, 


both at boot time and when a new process is allocated. 


4.10.7.1 Black Hole 

Upon power-up, the DMA engine, fabric switch, and links are in reset state, but there may be circumstances in 
which the initialization sequence is entered with some or all in operation. In particular, after a node crash induced 
by hardware or software failure, it is desirable to keep traffic flowing through the switch and links while the node 
reboots. To support such cases, the block reset register includes functions which ignore all packets entering or 
leaving the switch at its node. 
4.10.7.2 Reset 

During initialization, the following registers should be set up: 

e the block reset register should be set to inhibit traffic into and out of the local interface of the fabric switch 

e the thread select register should disable all 10 threads 

e the ECC mode register should be set to enable correction 

e the force error register should be cleared 

e ECC error interrupts should be disabled in the interrupt mask register 
After the instruction and data memories have been loaded and the common resources set up, these registers can 
be returned to their normal state. 
4.10.7.3. Microcode load 


The DMA Engine microcode assembler, dmaas, translates a symbolic representation of the microcode (called 
dma.lisp) into a numerical representation which specifies the microinstructions themselves and the initial states of 
dmem and thread-state variables. This is called the .load format: 


4.10.7.4 Variable binding 


Many of the DMA registers accessible through I/O reads and writes have values that are important to the 
device driver, but the particular address assignments may change from one version of microcode to another. The 
microcode .load format provides the necessary information to translate symbols to addresses, and initialization 
software is expected to refer to interface registers by using strings to name the register, translating the string to an 
I/O space address on the basis of the current .load file (perhaps using SymbolTableMap). 
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Microcode version By convention, a microcode variable named ucodeVersion is assigned to location dmem 
location 511 (Ox1FF). It contains in bits 31:0 the svn revision number at which the source code was committed; in 
bits 39:32 an identification code for the API it implements (3 for this specification); and 63:40 are defined according 
to the API code. 


4.10.7.5 Initialization of common resources 


The dma initialization software which runs during the boot process loads the microinstruction memory (using 
writes to R-Dmalmem), dmem constants and global variables (using writes to RIDmaDmem), and the thread state 
variables (using writes to RLDmaThreadPtr|] and R-DmaThreadPcl]) as specified in the .load file. The following 
table lists the symbols needed for system initialization: 


Symbol Description (initial value) 


portQRegion Physical address and length of region reserved for transmit port queue 
portQRdPtr Transmit port queue read pointer (copy of portQRegion) 


portQWrPtr Transmit port queue write pointer (copy of portQRegion) 


rxErrorCnt | - | Count of bad packets received 


qmgrErrorCnt | - —_{ Count of context 0 event queue overflows 


Certain dmem values refer to physical memory regions which are allocated by the kernel for use by the DMA 
Engine. In addition to areas used by each process, each port relies on reserved memory regions in which it can store 
a queue. For each queue, there are three doublewords in dmem, called the region descriptor, the write pointer, 
and the read pointer; the region pointer and write pointer should be initialized to the same value: in bits 35:0, 
the physical memory address of the area of memory allocated for use by the queue (bits 5:0 must be zero); in bits 
63:36, the negative length of the allocated region. Thus, if the allocated region is 65,536 bytes (0x10000) starting 
at address 0x123456780, the doubleword value should be 0xFFF0000123456780. The read pointer should have the 
same address in bits 35:0, but zero in 63:36. 

The eight areas allocated for port queues must be non-overlapping, aligned to 128-byte boundaries, and a 
multiple of 128 bytes in length. 


4.10.7.6 Initialization of process resources 


As the system associates operating system processes to process state in the dma engine, it must allocate space 
in physical memory for the five communication regions used by each dma process: the heap, the buffer descriptor 
table, the route descriptor table, the command queue, and the event queue. The command and event queues 
are each described by dmem registers containing read pointer, write pointer, and region descriptor, as described 
above for the port queues. The heap, BDT, and RDT are each described by a single dmem register containing a 
region descriptor in which bits 35:0 contain the physical address of the start of the region, and bits 63:36 contain 
the negative of the region length. The following table lists the symbols needed for initialization of each process, 
whenever a new process binding occurs. To avoid an error wrap case, queue Rd and Wr pointers must be initialized 
to offset 128 in the region (add 128 to both the address and negative length fields). 


Symbol 
processID Process identifier (16 bits) 


counters Sixteen 4-bit counters used by Do_Cmd commands (init 0) 
entQRseOH 
eventQRAPT 
event WiFi 


cmdQRegion | Physical address and length of region reserved for command queue from this process 


cmdQRdPtr | Command queue read pointer (copy of cmdQRegion) 


omdQWaP 
BDTRegion 
RDTRegion 
HeapRegion | Physical address and length of region for process heap 

emd Quota 
deferredOnt 


eventIntCause | Interrupt cause word sent when an event is added to an empty event queue 
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4.10.8 Process Rundown 


This subsection will describe the sequence of events required to deallocate a DMA Engine process. 


4.11 Lessons for Next Time 


4.11.1 Queue Manager 


The performance of this design suffers from a couple of problems. 

In the first place, the queue manager must read a command from memory, then translate its route, before 
knowing which tx thread will service it. And the requirement to keep commands in order makes it difficult to 
evaluate other commands during that process. A better design might require each software process to enqueue 
commands into separate queues for each port: 4 command queues per process. 

In that model, each transmit thread could scan its own queues, executing fg commands as they were encountered, 
and pushing bg commands to the port queue, to be handled when all the fg commands were finished. This would 
make processing more efficient, both because of parallelism, and because control information would not need to be 
moved from memory buffer to dmem to packet buffer. 

It would be necessary to come up with a way of pushing and processing commands received from remote nodes. 
They could have their own queue area in memory, treated like a separate process, or they could be pushed through 
the background port queue like DMA commands. 

The “fastpath” mechanism points the right direction: use it for invocation of all locally-initiated commands 
(possibly except do_cmd). This allows dispatch to appropriate port thread right away, with RDT access before 
command fetch. In most cases, the queue access is needed only for the payload. The Tx thread might have separate 
priority levels for fastpath (nothing on queue), enqueue, foreground, and background. Received commands in 
Enq_Response packets would be enqueued by receiver. 


4.11.2 Additional functionality 
4.11.2.1 Enqueue/Dequeue commands 


There should be a means by which a node can create a ring-buffer queue which is available to all processes in 
the same job to insert or remove entries; it may not be important to have more than one such queue per process, 
since they can be distributed almost anywhere. If we need only one, it is easier to name, and we can keep the 
pointers in hardware. Need ways to report full/empty status on a request. 


4.11.2.2 Global locks 


It may (or may not) prove useful to create locks which ensure globally that no more than one process has access 
to a data structure. Perhaps the general solution is a class of atomic read-modify-write functions. 


4.11.3. Microcode 


The microcoded engine is convenient for a couple of reasons: it has special-purpose functions, it is multi- 
threaded, and it has a fat pipe to memory. It’s inconvenient that it is hard to program (no compiler), has no cache, 
and behaves differently than other bus clients, besides the fact that it needs a separate design. I don’t think the 
special functions, aside from the fat pipe, are worth much. We could have had a multithreaded MIPS core with 
prefetch and flush instructions, and done more with less. 


4.11.3.1 Buffer addressing 


We need indexed addressing into the packet and (especially) memory buffers, so that we don’t have to dispatch 
to separate microinstructions to access the appropriate doubleword. 


4.11.3.2 Buffer reset 


It would be good to be able to clear the memory buffer in a single operation, to prevent leaks of information 
between processes. 
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4.11.4 Copy port 


It was a mistake to try to short-circuit the fabric for local transfers. The local ports should have been a copy 
of the remote ports so that the hardware and microcode were exactly the same. Additional ports into this pile of 
latches should not be a problem. 


4.11.5 Receive ports 


The payload length needs to be writable for cases like deferred commands, where we want to combine the 
payload with a new header before writing to memory. 


4.11.6 Cache 


The DMA should have a cached, coherent interface to memory. The lack of coherent synchronization is (I 
suspect) going to prove to be a stumbling block. 


4.12 Microcode 
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Chapter 5 


DMA Engine 


by Jud Leonard and Bryce Denney 
[Last Modified $Id: DmaImpl.lyx 46805 2007-10-30 21:33:40Z denney $] 


5.0.1 Package Attributes 
Package 


chip_dma_spec 


Attributes 


-dwaccessors 


5.1 Introduction 


The DMA Engine provides a high-bandwidth interface between the memory system and the fabric switch, 
relieving software of the low-level work of repetitively creating packets of memory data and injecting them into the 
fabric, or accepting packets from the fabric and distributing their payload to appropriate locations in memory. 

This chapter describes the hardware of the DMA Engine. DMA Engine functions implemented by microcode,including 
the application-level software interface, are defined in another chapter. 


5.2 Implementation 


The ICE9 DMA engine is implemented as a programmable microengine that manages a set of TX and RX 
ports and an interface to the L2 cache. The microengine decides how to send outgoing packets and what to do 
with incoming packets, but relies on the other blocks to do nearly all data copying. Each of the TX and RX ports 
contain packet buffers, state machines, and address sequencers so that they can transfer to/from the fabric switch 
without consuming microengine cycles. The microengine reads its microcode from an instruction memory, which 
is initialized by system software at boot time. In each cycle it can perform an arithmetic operation on two 64-bit 
operands (A and B), producing a 64-bit result and a set of condition codes which can compute a branch target. 
Operands A and B generally read from the DMA’s dedicated data memory (DmaDmem) but can also address 
registers in the TX and RX ports, and the cache interface. 

Data moving through the DMA engine is stored in packet buffers while the DMA engine decides what to do 
with the packet and moves the data to the appropriate place. Imagine a packet that enters the chip on receive port 
1 destined for this node. The packet arrives on receive port 1 of the fabric link logic, passes through the switch to 
the DMA, and is stored in the block labeled “RX Port 1” until the DMA engine processes the packet. Each RX 
port can hold up to four such packets at a time (approx 80x64 bits) before it must use backpressure to prevent 
the switch from sending any more data. As packets arrive from the switch, the RX port wakes up the appropriate 
thread in the DMA microengine by asserting rxpX_ue_BufAvail so that the microengine can examine the packet 
and take appropriate action. Usually the microengine will decide to copy the packet to main memory at a particular 
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address, and start a block transfer. The cache interface and receive port implement the block transfer and free up 
the packet buffer without any further interaction with the microengine. 

Data moving in the other direction, from this node to the fabric, travel through the transmit ports in a similar 
way. Packets are transferred from main memory to a particular transmit port, e.g. TX Port 2 if the packet is 
destined for transmit port 2 onto the fabric. Each TX port can hold up to four such packets at a time (approx. 
80x64 bits). When the transmit port raises txpX_ue_BufAvail, the microengine has a chance to decide how each 
packet should be handled. When the microengine is done, the transmit port sends packets out to the switch and 
recycles the packet buffer. 

One other port, called the Copy Port, is used to send packets from one application to another within the chip. 
The copy port is designed to act very much like a transmit or receive port, so that hardware structures can be 
reused and library software can treat local (within the chip) and remote packet transfers in a similar way. The 
copy port can be used to perform traditional DMA memory-to-memory copies. 

The microengine threads need to read and write L2 memory to manipulate queues and other data structures in 
memory. For this purpose, each microengine thread has a dedicated memory read buffer of 16 doublewords and a 
memory write buffer of 16 doublewords. The thread can schedule memory transfers into these buffers, wait until 
the transfer is complete, and manipulate the data. These buffers live in the Copy Port. 

To service the ports, the microengine has about ten concurrent threads which contend for resources when they 
have something to do. Most threads are associated with a switch port (or the copy “port”, or the queue manager). 
In addition, there is what might be thought of as a runt thread which has no preserved state, but which executes 
microinstructions to access datapath registers whenever an I/O reference to the DMA engine needs service. 


5.2.1 Top Level Block Diagram 


Here is block diagram of the DMA engine that shows the major blocks and data buses. 


from to 
Fabric Fabric 
Switch Switch 
A 
ALU A ln _§4 
ry 
ALU B In _ 64 
ry ry 
ALU Result _ 72 
| | feo 
ry 
RX TX 
Port 0 Port 0 ALU 
Scratchpad 
| i | DMem 
x 1K x 72 
RX TX Copy 
Port 1 Port 1 Port Microengine 
| | , | Control 
v 
RX TX yo6C«sS 
Port 2 Port 2 ea a 
a tT | 
Cache Interface 
RX Interface TX Interface 
Oddbound 
L2 Data Bus 
Evenbound 
L2 Data Bus 
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5.2.2 External Interfaces 


5.2.2.1 Fabric Switch to DMA receive port X (X=0,1,2) 


For each of the chip’s three RX links, the fabric switch forwards data to three corresponding RX ports in the 
DMA engine. The interface for data traveling from fabric switch to DMA is described below. When no packets are 
being received, the data wires can be used for the fabric switch to send status information. 


dma | fsw RdyX_sla DMA is ready to accept another packet (X=port 0,1,2). 
When the FSW begins a packet that consumes the DMA’s 
last buffer, DMA deasserts dma_fsw_RdyX_sla one cycle 
after the SoP. 


DatVal_s2a contains valid data (X=port 0,1,2) 


5.2.2.2 DMA transmit port X to Fabric Switch (X=0,1,2) 


The DMA engine has three transmit ports corresponding to the chip’s three transmit links. Each transmit port 
carries data to the fabric switch, which sends it to the appropriate link, using the interface described below. When 
there are no packets to transmit, the DMA can update the fabric switch’s control registers. 


fsw | dma BufA vailX_s3a FSW is ready to accept another packet (X=port 0,1,2). 
The DMA samples fsw_dma_BufAvailX_s3a each cycle in 
order to decide whether it can begin a packet in the next 
cycle. 


5.2.2.3. DMA to L2 Cache Switch 


See 7.2 in L2 Cache chapter. 

The DMA can start one CmdAddr transaction per cycle and one Data transaction per cycle onto the L2 cache 
switch buses. In each cycle it may request the even CmdAddr bus or the odd CmdAddr bus, but never both 
directions at once. Also, it may request the even Data bus or the odd Data bus, but never both directions at once. 
Meanwhile, the DMA can accept one incoming CmdAddr transaction and one Data transaction per cycle. 

The DMA engine can have up to four outstanding block reads and four outstanding block writes to the L2 
cache. In addition it responds to I/O reads and writes from the six processors. 


5.2.3 Module Hierarchy 


Before diving into the details of each component, here is a tree that shows how the DMA engine is organized 
into modules and submodules. 


e Dma: top level of DMA engine 


— DmeUe: microengine control logic 
*« RAM containing microengine instructions 
— DmadAlu: microengine ALU 
— DmaDmem: microengine data memory 
— DmaCif: L2 cache interface 
* several queues to keep track of outstanding requests 


— DmaRxp: contains the three RX ports that receive from fabric switch 
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x DmaRxpCtl0: RX port logic for port 0 
x DmaRxpCtll: RX port logic for port 1 
x DmaRxpCtl2: RX port logic for port 2 


* packet buffers 
— DmaTxp: contains the three TX ports that transmit to fabric switch 


x DmaTxpCtl0: TX port logic for port 0 
x DmaTxpCtll: TX port logic for port 1 
x DmaTxpCtl2: TX port logic for port 2 


* packet buffers 
— DmaCopy: copy port, for memory-to-memory transfers 


* packet buffers 
* memory read buffers 


* memory write buffers 


5.2.4 DmaUe: Microengine Control Logic 


The microengine is implemented with a four-stage pipeline, consisting of thread selection (C2), instruction 
decode (C3), ALU (C4), and write result (C5). 


C5: 
C2: Thread Select C3: Instr Decode C4: ALU Result 
valid_c3 valid_c4 valid_c5 
Sleep Thread 
Cond select thread c3 thread_c4 thread_c5 
nextSleepCond_c3 , instr_c4 instr_c5 
ulnst : instr_c3 
mem a 
1024x72| o 
nextAddr_c4 ! 
PC | | | 
' Dmem i 
| even | 
| opA 512x64 __ aluResult_c5 
I Dmem 
opB odd 
512x64 1 
Thread ! 
State dest 
specialRegs ! 
Sleep nextSleepCond_c3 i 


The following table describes the pipeline for the microengine in more detail. 
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[Stage [ Nee 


a Sel Choose thread to run next using round 
robin scheme. Once a thread runs, it 
must wait a cycle before it can run again. 
Find the program counter for the selected 
thread. If the same thread is selected in 
C4, the next uPC is bypassed from C4 in- 
stead. The bypass allows us to avoid hav- 
ing a branch delay slot. 


Instruction Decode Read microinstruction memory on rising 
edge. Decode instruction. Prepare to 
read operand memories/registers by driv- 
ing OpaAddr and OpbAddr. 


C4 ALU All operands are read from respective 
memories on the rising edge of C4 and sent 
to the ALU. The ALU result is computed 
and registered at then end of C4. Com- 
pute the NextAddr for the thread, and by- 
pass it back to C2 in case the same thread 
is selected again. Prepare to write results 
to memory by driving the ResultAddr and 
ResultData buses. 


C5 Write Result Write ALU result to selected memory on 
rising edge. Write changes to thread state 
registers. If necessary, ask the cache inter- 
face to start a memory transfer using the 
TaskStart interface. 


FIXME: Document how I/O reads and writes get into the microengine and how it deals with them. 


5.2.56 Dmalmem: Microengine Instruction Memory 


The Dmalmem contains microinstructions that the DMA engine will execute. The instruction memory is 
initialized using WTIOs from a processor, while the DMA microengine is idle (all threads disabled). The data to 
be written flows through the datapath and ends up on alu_xxx_ResultDat_c5a<71:0>, which contains both data 
and ECC. If ImemFlipMemBits are used, the data can be intentionally corrupted before being written. 

Once the DMA threads are enabled, the microengine reads one instruction of Dmalmem per cycle, does ECC 
correction, decodes the instruction, and executes it. It is unsafe for a processor to write Imem while the DMA 
threads are running. 


Dmalmem 
1024x72 
10 ECC 64 
FinalPc_c2. —«———————————__ RdAddr RdData = Imeminst_c3 
10 
ResultAddr_c5 —~4——____{_ WrrAddrr 
72 
ResultData_c5 A xor WrData 
ImemFlipMemBits low 2bits 
(from R_SDmaForceErr) 


Implementation note: For timing reasons, the Imem is implemented as four banks of 256x72, interleaved on bits 
1 and 0 of the read address. In each read cycle, all four banks are read in late C2, and the correct result is selected 
based on address bits 1:0, which arrive from the ALU in early C3. 
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5.2.6 DmaAlu: Microengine ALU 


The microengine ALU is designed to calculate memory addresses and queue pointers. It also contains some 
general arithmetic such as add and subtract, booleans, etc. 


Oph [63:0] > 
« Opal63:0}_ 
y 
OpbMux 
y 
Logic 
63 36 35-32 31 16 0 
inj_b36 inj_b32 
y 
ee | +, ~«+— cin 
Add/Sub — 
lle 
NMUX 4:1 
y y y 
Zero Detect he} be —_ | 
fps] zee 
ZMUX al 
N Z 
v v v v 
ResultMux 
v al v 
@lu_NextAddr | NextAddr Calcuation < je_NexAddr 
v 
Flop Result 
ResultMemAddr 
ResultMemLen 


5.2.7 DmaDmem: Microengine Data Memory 


The Dmem is the microengine’s scratchpad memory. It can be read and written by every instruction, and is 
also accessible to processors via I/O reads and writes. The Dmem is divided into four banks of 256 words by 64 bits 
each, and operands A and B can address the banks independently. Operands A and B can read different addresses 
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from the four banks of DmaDmem. However, if the two operands try to access different addresses in the same bank 
in the same instruction, operation is undefined. The hardware simulation models will provide asserts to detect this 
condition. 


Since a thread may execute every two cycles, a potential data hazard exists between results written in C5 and 
operands read in C3 from the same address. The register file does not like to be written and read at the same 
address. To avoid the hazard, a bypass register allows ResultDat_c5 to be delayed until C6 and then driven onto 
the operand A or B data bus when the read and write addresses match. 


One bank of the DmaDmem is described in the diagram below. The Dmem is interleaved on bit numbers 
(DMEM_INTERLEAVE_BIT and DMEM_INTERLEAVE_BIT+1), presently 4 and 5. To produce addresses for a 
given Dmem bank, the interleave bits must be removed. 


DmaDmem 
Bank 0 
512x72 


remove 
interleave 
bit 


OpaAddr_c3 —— 


RdAddr RdData ECC 64 


correct OpaData_c4 
Lae, 
OpbData_c4 


OpbAddr_c3 —— 


64 


ResultAddr_c5 WrAddr 


ResultData_c5 WrData 


DmemFlipMemBits 
(from R_SDmaForceErr) 


Bypass Reg 


Process Index (0-13) 


Address<3:0> Thread number (0-9) 


We have allocated a 1K x 64 register file to hold control/status information (12 processes * 16 doublewords), 
(10 threads * 16 doublewords), and contexts (2 directions * 4 ports * 4 contexts * 16 doublewords). 


Each of the transmit and receive threads (including the “copy” instances of each) has four hardware contexts for 
which it is responsible; each such context consists of 16 doublewords which can be used as needed by microcode. 
The allocation is chosen to correspond closely with the structure of commands, so that a queue entry can be loaded 


directly into the context memory. 


The current assignment for transmit contexts is: 
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DMem Process Variables (0-255) 


Figure 5.1 
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DMem Thread Variables (256-511) 


Figure 5.2 
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Figure 5.3: DMem Tx Context Variables (512-767) 
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Figure 5.4: DMem Rdt/Bdt Cache (768-1023) 
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[0 [70 | command [DMA command topafom SY 
[0 [63:8 header —[ RDT data to be put im header ford = 


BDT data combined with offset 
bufter handle | BDT index 
per 0 se 


For receive contexts, the assignment is: 


a 
63:32 context id, local process index 
remaining segment length 
5 


notifier pay | Software payload for notifier message 


5.2.8. DmaRxp: Receive Ports 


The DmaRxp module contains three instances of the receive port, connected to the three data ports coming 
from the fabric switch. 


A DMA receive port queues packets as they come from the fabric switch, one doubleword (64 bits) per cycle. The 
header, control, and trailer FORDs are captured into one register file (Oprf), while other doublewords are stored 
in a packet buffer (Pbuf) which can be quickly dumped to memory through the cache interface, eight doublewords 
(512 bits) per cycle. The DMA microengine decides what should be done with the packet: either throw it away 
or schedule it to be transferred to main memory. The Pbuf can store DMA_PBUF_N different packets (presently 
4) before it must tell the switch to hold off until another buffer is available. Both Pbuf and Oprf are readable on 
operand B at a rate of 64 bits per cycle. 


CAUTION: The receive port control logic uses uncorrected data from the fabric switch in several cases. The 
uncorrected HasCtrl bit in the header is used to determine whether the second Ford is to be treated as a control 
Ford or payload. The uncorrected ProcessIndex in the EoP is sampled into registers, which are retimed into the 
cclk domain and drive the rxpN_ue_ProcessIndex ports. In February 2006, we decided that using the uncorrected 
data was acceptable because of the way the fabric switch drives data to DMA. The FswDmao block corrects and 
generates new ECC just before driving 72 bits of data to the DMA engine, so the DMA will always see good ECC 
coming from the FSW (barring logic or interconnect problems of course). Because the ECC is known to be correct, 
we will continue to use uncorrected bits for the purposes of HasCtrl and ProcessIndex only. The HLM contains 
assertions that complain if these bits are ever corrupted (by doing an ECC correction and checking which bit was 
flipped) so that we will know if this condition ever occurs. See Bug1143 and Bug1160. 


Dma Receive Port Block Diagram 
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Address | | 
Sequencer 
WrData WrData 
EoP WrAddr 
| Pbuf Oprf 
c a Packet Buffer Operand Regfile 
7) SSS eg 8 NS eS Seon 
2 ° Array of flops Regfile 1R,1W 
8x72 x 8 banks 44x72 
UeCifPending<3:0> 
Butfer State Machine (x4) RdDatad RdData1 RdData 


4144 
correct 


ECC 


64 
v 
MemOutData 
Operand B 
(to Cache Interface) Data Out 


sclk domain 


cclk domain 


The Pbuf is organized as DMA_PBUEF_N different buffers of DMA_PBUF_WORDS words. For DMA_PBUF_N=4, 


the address into the register file looks like: 


bits 6:2 bits 1:0 


The maximum offset is not a power of two, but it’s easy to make the maximum buffer number a power of two. 
We chose DMA_PBUF_N = 4. By putting the buffer number in the low order bits, we can populate as many offsets 
as we wish without wasting memory. If DMA_PBUF_N=4 and DMA_PBUF_WORDS=19, the memory size is 19*4 
= 76 words. The Pbuf must be implemented in a way that supports 128-bit reads of offset N and offset (N+1) for 


even N. 


The Oprf is used for two purposes: several words are used to store the header ford, control word, trailer word, 
etc. for each packet. In another part of the Oprf, we store status information from the fabric switch. The status 
information can be read through the operand bus so that software running in the cores can access it. The physical 


organization of Oprf is: 


Oprf address 
0x00 - Ox1F 
0x20 - 0x2F 


o 


escription 


Switch status information 
RX port control registers for each buffer 


For addresses 0x20-0x2F, the Oprf address decodes as follows 


bit 5 | bit 4 bits 3:2 bits 1:0 
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Reg number definitions for Oprf bits 3:2 are: 


unused 


il 


The receive port contains a buffer state machine for each of the DMA_PBUF_N packets in the buffer. Each 
buffer state machine is independent of the others, except that only one buffer may be in state UE at a time. The 
state diagram for each buffer is shown in Figure ??. 

Dma Receive Port Buffer States 
SWRX 


buffer ready 
to receive data 


EoP and not 
(ueAvailable 
and uePktSel=me) 


MemOutLast 


EoP and 
(ueAvailable 
and uePktSe/—_me) 


BufSkip 


CA 


waiting for 
cache op 


WAITUE 


waiting for 
microengine 


BufTransfer 


ueAvailable 
and uePktSel=me 


UE 


microengine 
operating 


Notes: 
ueAvailable is 1 when no buffer is in the UE state, or 0 otherwise. 
uePktSel is the number of buffer which will enter the UE state next. 


The DmaRxp module spans two clock domains, sclk (switch clock) and cclk (core clock). The data arrives in 
sclk time, and is written into the Pbuf on sclk edges. When a packet is completely transferred, the microengine 
and cache controller (running on cclk) read the data when it is known to be stable. A 4-state FSM per buffer keeps 
track of which buffer is being used in which way. The register file is the primary means of synchronizing data across 
domains, but several control signals need to pass across clock domains using synchronizers. 

EoP (sclk to cclk) is produced by the switch when it sends the last doubleword in a packet. When EoP comes 
from the switch, it is a one-cycle pulse in sclk. This passes through a pulse synchronizer! and becomes a one-cycle 
pulse in cclk. In the cclk domain, EoP tells the state machine that a buffer is completely transferred and ready to 
be used by the microengine. EoP causes the state transition from ST_SWRX to either ST_WAITUE or ST_UE. 

UeCifPending<3:0> (cclk to sclk) is a bit vector produced in the cclk domain that tells the sclk logic whether 
a buffer is in use (by microengine or cache interface) or ready to receive another packet. Individual bits of Ue- 
CifPending are set by the arrival of EoP and cleared when the microengine and cache interface are done with the 
packet. To address the dangers of sending a 4-bit bus through separate synchronizers, the bits are sent using a 
module BusSyncOneWay which implements a handshake protocol and resamples into the destination clock domain 
in a safe way. 


5.2.9 DmaTxp: Transmit Ports 


The transmit ports are similar to the receive ports except that the data flows from the L2 cache to the DMA 
transmit port to the fabric switch. The microengine can either ask the cache interface to write a packet’s payload 
into a packet buffer, or write it directly via the Result data bus. The microengine also writes the header, control, 


1A one-cycle pulse in the source clock domain generates a toggle signal which is passed through a synchronizer. When any transition 
is detected in the destination clock domain, it is turned into a one-cycle pulse in the destination clock domain. 
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trailer FORDs, and the payload length into registers in the Oprf. Then the address sequencer in the transmit port 
takes over, and sends a packet to the fabric switch, 64 data + 8 ecc bits per cycle. 
to Fabric Switch 


72 
Address 
Sequencer | 
RdData RdData 
TxDone anedt 
oO Pbuf Oprf : 
S LQ Packet Buffer Operand Regfile sclk domain 
eo aAl--- hte area yO Fig ee a 
8 © cclk domain 
aa) Array of flops Regfile 1R,1W 
8x72 x 8 banks 44x72 
UeCifPending<3:0> 
Buffer State Machine (x4) WrData0 WrData1 WrData 

xor low 

2 bits 

of each xor 

72-bit word low 2 bits 
TxpFlipMemBits 
2 
144 72 
MemInData 


(from Cache Interface) pesullDale 


Unlike the receive port, the transmit port may need to read packets from memory which are not aligned in a 
convenient way. The packet payload may start on any 32-bit boundary (memory address is a multiple of 4). To 
handle unaligned packets, the Pbuf is large enough to hold 3 cache blocks per packet, and a 32-bit alignment mux 
is placed on the path to the fabric switch. If the packet payload is aligned, only two cache blocks are needed and 
the data is driven to the fabric switch starting at address 0. But in unaligned cases, the DMA cache interface may 
need to read three cache blocks into the Pbuf, knowing that some bits will not be used, and then read out just the 
relevant data. 


5.2.10 DmaCopy: Copy Port 


The copy port is used when sending packets to destinations within the node. Packets are loaded into into a 
packet buffer from the source address and then written back to memory at the destination address. The microengine 
treats the copy port as a transmit port and a receive port, and in fact the transmit and receive functions of the 
copy port are managed by separate threads. 

Like the transmit port, the copy port needs to be able to read packet payloads at various alignments, down to 
any 32-bit boundary. Data is written to the Pbuf aligned exactly as it is in main memory, then realigned properly 
by the cache block realignment module as it is read out. 
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The copy port also contains a memory read/write buffer (RWMB), which give the microengine a way to read 
and write cache lines to/from the memory system directly. The RWMB is 10 threads * 16 doublewords = 160 
words by 64 bits, plus 6 extra words to assist I/O operations to/from the 6 processors. 

Operand B Data 


64 
Result Data 
772 
ECC 
correct 
Yy 
RdData WrData WrData Pbuf 
Packet Buffer 
Oprf uEngine Array of flops 
Operands Memory 12x72x 8 banks 
Read/Write 
Buffer Addr -— >| RdAddr 
Flops Seq 
12x64 Array of flops 
26x72 x 8 banks 
Addr 
Seq (>| WrAddr 
RdData 


Cache 
Block 
Buffer State Machine (x4) Realign 


144 


xor 
low 2 bits 


+144 


144 


’ CopyFlipMemBits<1 :0> 


MemOutData 
(to Cache Interface) 


MemInData 
(from Cache Interface) 


5.2.11 DmaCif: Cache Interface 


The cache interface manages transfers between the L2 Cache Memory Bus and buffers inside the DMA block. 
The DmaCif handles the details of the L2 memory bus protocol: requesting the CmdAddr bus and the Data buses 
and handling I/O reads and writes from the processors. Each microengine thread can start memory transfers 
or “tasks” via the TaskStart interface and optionally wait for its memory transfers to complete. The TaskStart 
interface determines the memory address and length of transfer by copying the MemAddr and MemLen register 
value for the requesting thread. (Exception: some transfers specify a fixed length so the MemLen value is not 
always used). Tasks are placed in queues where they wait for their turn to use the CmdAddr or Data bus. Memory 
transfers move data between main memory and the TX, RX, and Copy port buffers in the DMA engine by driving 
the MemIn and MemOut interfaces. MemIn controls data moving from main memory into DMA buffers. MemOut 
controls data moving from the DMA buffers out to main memory. As I/O reads and writes arrive from the CSW, 
they are sent to the microengine via the Startlo interface. 

The per-thread counters and per-port counters keep track of how many requests are waiting in queues or 
outstanding read/write tables, so that the DmaCif can notify the interested parties when the requests that affect 
them are finished. As a task is first processed in TaskStart, the per-thread and per-port counter is incremented. In 
222 
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the queues, each task is tagged with the microengine thread number that initiated it, so that the correct counter 
can be decremented when the task completes. The outputs of the thread and port counters are TaskPending 
and RefCntZero. TaskPending<9:0> tells the microengine which threads have outstanding memory requests. 
One RefCntZero signal goes from the cache interface to each port (rx, tx, copy) telling it whether there are any 
outstanding memory requests. 

A block diagram of the DmaCif is below. 


Bc oS = 
ee 26 SER 2o 25 ae 
aes a 2H, = « SE & 2, Be 
-</S/s a > £ SO] o| Oo es ae) GO Oo 
S Ww isu > 
a §/2|2 % & BS) €| & % > Qo gs E 
ce=|s|5 Fe Ss] £| re c2 as| 
MemA 
selene TaskStart Interpret per thread per port Interpret 
per thread Interface CmdAddr counters counters Data 
MemIn requests 
Write DMA | (to DMA buffers) 
buffers 
Co Ute 
Leeda Leen [ye eee L WtioArrival 
Lhe ieee Oe ee 
#.... @...) g...) g--. 
So) oo} Soy oa 
CmdAddrReq/G * 
a eqiGut Startlo (to uEngine) e 
Drive Drive Startlo 
L2 CmdAddr L2 Data Interface RdyForStartlo 
GmdAddr (to CSW) iP (from uEngine) 


MemOut requests 
(to DMA buffers) 


Outstanding Outstanding 
Read Table Write Table 
(ORT) (OWT) DataReq/Gnt 
For each type of traffic that passes through the DmaCif, the next few paragraphs will describe what path the 
requests follow. The traffic types are 


Symbol 


BWT 
RDIO I/O read 


WTIO | I/O write 


SPCL 


Block Read 

The microengine drives the TaskStart interface, and the request is placed in ReadWriteQ. The request cannot 
leave ReadWriteQ until an ORT entry is available; the number of the selected ORT entry (0-3) determines which of 
the DMA’s transaction IDs will be used. When the request comes out of the queue, the DmaCif arbitrates for the 
CmdAddr bus in the appropriate direction and drives a BRD command onto the bus. The ORT entry is written 
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with the details of this block read request, so that we know how to handle the data when it arrives. If more cache 
blocks are required to finish the request, the next cache block request is written into ReadWriteExtQ. Wait for 
data to be returned or for a PRBNOHIT. 

If a PRBNOHIT is seen on the CmdAddr bus, the Interpret CmdAddr block looks up the entry in the ORT to 
find the address, then places a BRDR request into the BrdrQ. 

When data arrives from the CSW, it enters the Interpret Data block, which uses the TID to find the correspond- 
ing ORT entry. From the ORT we know which of the DMA buffers will be written and the starting address. The 
DmaCif drives the MemIn interface to tell the DMA buffer to write, and data from CSW flows into the buffer. The 
Interpret Data block also places a PRBDONE request into the DataRspQ if needed (only if DataOrigin indicates 
that the data did not come from a coherence controller.) 

When a microengine thread starts a read operation, the cache interface to increment a per-thread counter by 
one. When the transfer is finished (the data is ready to be used by microcode), the counter decrements by one. 
If the thread decides to sleep until memory operations are done, this per-thread counter controls when the thread 
wakes up. 

Block Write 

The microengine drives the TaskStart interface, and the request is placed in ReadWriteQ. The request cannot 
leave ReadWriteQ until an OWT entry is available; the number of the selected OWT entry (0-3) determines which 
of the DMA’s transaction IDs will be used. When the request comes out of the queue, the DmaCif arbitrates for 
the CmdAddr bus in the appropriate direction and drives a BWT command onto the bus. The OWT entry is 
written with the details of this block write request, so that we know what to do when the “go” command arrives. If 
more cache blocks are required to finish the request, the next cache block request is written into WriteExtQ. Wait 
fora BWTGO, BWTNOHIT, or PRBINV command. 

When the BWTGO, BWTNOHIT, or PRBINV command arrives from the CSW, it enters the Interpret Com- 
mand block, which uses the TID to find the corresponding OWT entry. From the OWT we know which of the 
DMA buffers will be sent to memory and the starting address. The DmaCif drives the MemOut interface to tell 
the DMA buffer to send, and data from the buffer flows into the CSW. For BWTNOHIT or PRBINV, the data is 
sent to the even or odd coherence controller; for BWTGO the data is sent to the module that sent the BWTGO 
based on CmdAddrOrigin. 

When a microengine thread starts a write operation, the cache interface to increment a per-thread counter by 
one. When the transfer is finished (the data has been sent to the CSW), the counter decrements by one. If the 
thread decides to sleep until memory operations are done, this per-thread counter controls when the thread wakes 
up. 

I/O Read 

A RDIO command arrives from the CSW and enters the Interpret CmdAddr block. The request is placed in the 
StartIloQ where it waits to enter the StartIo interface. Eventually it reaches the head of queue and is driven to the 
microengine. The microengine completes the I/O read operation and puts the result into a known location in the 
Copy Port’s Write Memory Buffer. Then the microengine drives the TaskStart interface to ask DmaCif to respond 
to the I/O read. The request is placed in the DataRdioQ, we arbitrate for the Data bus and drive MemOut to read 
the data from the copy port. Finally the data moves from the copy port to the CSW to complete the I/O read 
operation. 

The per-thread counters are not affected by I/O reads. 

NOTE: The DMA contains bug1991 in which RDIO can be corrupted by a WTIO following it from the same 
core. See 8 for details. 

I/O Write 

A WTIO command arrives from the CSW and enters the Interpret CmdAddr block. Three things happen: 
1) an RDIO request is placed in CmdRdioQ, 2) the details of the WTIO command are placed in StartIoQ, and 
3) a bit is cleared in WtioDataReady<5:0> a bitmask that records whether the write data has arrived or not. 
WtioDataReady is indexed by core number. The RDIO is not allowed to issue from the CmdRdioQ until the 
WTIO has reached the head of the StartloQ. When the WTIO reaches the head of the StartIoQ, the RDIO goes 
out onto CmdAddr to the processor. Then we must wait for the core to send data. 

When the data arrives, it enters the Interpret Data block, which uses the TID to know which core sent the 
data. Knowing the core number, we know where the write data is supposed to go. The DmaCif sends a MemIn 
request to the copy port to put the I/O write data into the Memory Read Buffer in the copy port. The Interpret 
Data block also sets WtioDataReady<corenum> so that it is allowed to issue from the StartloQ. 

Finally, the WTIO request issues from the StartIloQ and is sent to the microengine. The microengine completes 
the I/O write operation by reading from the copy port and writing the data into the memory selected by the address 
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from the CmdAddr cycle. 
The per-thread counters are not affected by I/O writes. 


NOTE: The DMA contains bug1991 in which RDIO can be corrupted by a WTIO following it from the same 
core. See 8 for details. 


SPCL (Special) Command 


A SPCL command is treated like an I/O Read because it is triggered by just a CmdAddr cycle. The SPCL 
arrives from the CSW and enters the Interpret CmdAddr block. The request is placed in the StartloQ where it 
waits to enter the Startlo interface. Eventually it reaches the head of queue and is driven to the microengine. The 
microengine completes the SPCL operation, then drives the TaskStart interface to ask DmaCif to respond to the 
SPCL. The request is placed in the SpclIntQ, we arbitrate for the CmdAddr bus and send the DONE command 
back to the core. 


The per-thread counters are not affected by I/O writes. 
Interrupts 


Microcode causes an interrupt by setting the memOp field to “sendIntr” and placing 16 bits of interrupt data 
on the alu result. The alu result bits 15:12 are the bus stop number to deliver to, and alu result bits 11:0 are the 
unique number that tells the processor which interrupt fired. The INTR command is placed in the SpclIntQ. When 
it reaches the head of queue, we arbitrate for the CmdAddr bus and send the INTR command to the core. There 
is no response. 


The INTR operation increments the DmaCif’s thread counter by one. When the interrupt has been sent on the 
CmdAddr bus, the thread counter decrements again. 


5.2.11.1 Cache Interface Queues 


For each of the queues in the block diagram, the table below tells the size, data representation, and what types 
of commands would use the queue. 


Here is the table for ICE9: 


Notes 

TO threads "2 reqs por thread 

10 threads * 2 reas per thread 

[ GmdRaioQ | DmaCiProtocolmany | 6 | __ RDIO___| B cores 

Toutstanding reads 

6 SPCL response + 10 INTRS 
outstanding writes 

[DataRdioQ | DmaCiProtocolentry | _6 | _RDIO____| Tor 6 cores 

T outstanding wits 

Scores * (I read or SPCL 1 wiite) 


Twice9 the number of outstanding reads and writes changed, and the number of cores changed. 


Notes 

T0 threads * 2 reqs per thread 

TO threads = 2 reqs per thread 

10 cores 

7 outstanding reads 

10 SPCL response + 10 INTRS 
[DataRspQ__[DmaCitProtocol&ntry |__7 | _ PRBDONE | T outstanding writes 

for 10 cores 

[Dates | maPrviceel ary || BW ontstending wl 

[_StartioQ | DmaCifStartiontry | 20 | WTTO,RDIO__[ 10 cores * (I read or SPCL + 1 we) 
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5.2.11.2 Interfaces in DmaCif 


TaskStart 


MemIn 


L2 Cache Interface 


The microengine asks cache interface to 
start a data transfer using the TaskStart 
interface. 

The cache interface notifies the micro- 
engine that an I/O read or write has oc- 
curred. The microengine sends back a sta- 
tus signal that tells when another request 
can be sent. 

The MemOut bus carries data from packet 
buffers out of the DMA buffers to L2 mem- 
ory. MemOut is connected to the three re- 
ceive ports, the RX side of the copy port. 
The MemIn bus carries data from memory 
into the DMA packet buffers. MemIn is 
connected to the three transmit ports and 
the TX side of the copy port. 

The DMA can arbitrate and write com- 
mands, then arbitrate and write data onto 
the L2 Cache memory bus. The CSW can 
carry data in either direction. When other 
blocks write to the DMA via the memory 
bus, the cache switch hands the DMA one 
CmdAddr value and one Data value per 
cycle. 


5.2.11.3. TaskStart Interface (Microengine to DmaCif) 


CHAPTER 5. DMA ENGINE 


The microengine requests memory transfers using the TaskStart interface of the cache interface, described in 


Table 5.1. The timing of the TaskStart interface signals is described in Figure 5.5. 


The cache interface contains four queues which record memory transaction requests. When the microengine 
requests a transfer (raises TaskStart), that thread’s current MemAddr is placed into one of the queues along with 
all the parameters of the transfer. Read and write tasks are placed into separate queues, so that writes cannot 
get stuck behind reads and vice versa. Some memory transfers are several cache lines long and must be done 
in several steps. As a step is completed, if there is more to be transferred, a new task is placed at the tail of 
another queue, called the “extended” queue. So, the cache interface contains a total of 4 queues: WriteQueue, 
WriteExtendedQueue, ReadQueue, and ReadExtendedQueue. The cache interface will pick the operation at the 
head of one of the queues and work on it until completion, then pick another in the next cycle. 


5.2.11.4 StartIo Interface (DmaCif to microengine) 
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Table 5.1: TaskStart Interface from Microengine to Cache Interface 
| Signal et te tee | Description | 


TaskStart Cif C5 When asserted, causes a memory transfer 
to start 


TaskThread<2:0> Cif Tells which microengine thread has started 
a memory transfer. Each thread has its 
own MemAddr value, so TaskThread tells 
which value to use. Also, the cache inter- 
face keeps record of how many transfers are 
pending per thread and reports back to the 
microengine. 


TaskTarget<5:0> elie of the DMA’s memories will be ac- 
ie 


TaskType<1:0> Is it a read or a write operation, and what 
kind? The types are cacheline read, cache- 
line write, I/O read response, and I/O 
write response. See 5.5.31 for encoding. 


TaskTid<5:0> Ue Cif C5 Transaction ID of the task. This is only 
LE | atl sett 
TaskOrigin<5:0> CSW bus stop namien of the core that 
ae aed a eee 
rxp0_cif_U Rxp0 | Cif C5 Receive port 0 tells the Gach interface 

which packet buffer number the micro- 
engine is working on. 
rxpl_cif_UeBufNum | Rxpl | Cif C5 What buffer is UE working on, in Receive 
Face (| rs Re 
rxp2_cif_UeBufNum | Rxp2 | Cif C5 What buffer is UE working on, in Receive 
ican) al iris ean 
port 


TaskPending<7:0> Cif Ue C6 Bitmask per microengine thread which 
tells whether there is 1 or more memory 
transfer in progress. 

TaskFull<7:0> Cif Ue C6 Bitmask per microengine thread which 
tells whether a thread has already 
launched the maximum number of mem- 
ory operations. 


Table 5.2: Startlo Interface from Cache Interface to Microengine 
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aS ee 


Startlo Cl When 1, trigger an I/O read or write mi- 
croinstruction based on the values on the 
Startlo signals in the same cycle. 


StartloType | cif [ue | Cl | Type of I/O operation. 0=read, 1=write. 


StartloAddr<15:0> cif ue Cl Address of I/O operation. The format. is 
the same as DmaBusAddr, consisting of a 
unit field and an offset field. 


StartloTid<4:0> CSW transaction ID of the I/O operation 


StartloOrigin<3:0> cif ue Cl CSW bus stop number of the core that sent 
this I/O operation 


RdyForStartlo ue cif C3 Microengine asserts this whenever it is 
ready to receive a StartIo operation. The 
CIF should never raise StartIo unless Rdy- 
ForStartlo is asserted. 


5.2.11.5 Interface to L2 Cache 


The cache interface performs four basic types of memory operations: read cache line from memory, write cache 
line to memory, respond to I/O write from core, and respond to I/O read from core. When reading cache lines, the 
DMA engine arbitrates for and writes the CmdAddr bus for one cycle to request data from memory. The response 
may come back many cycles later, so the details of that request are stored in the OutstandingReadTable (ORT). 
When the response arrives on the incoming Data bus, the OutstandingReadTable tells where the data should be 
sent within the DMA engine, e.g. transmit port 2 packet buffer at address 0x18. When the data is safely in the 
packet buffer, the ORT entry is freed so that it can be reused. We support up to 4 outstanding reads at a time. 
When writing cache lines, the DMA engine arbitrates for and writes the CmdAddr for one cycle, then when a 
BWTGO comes back, it reads data from the selected internal memory, then arbitrates for and writes the Data bus 
for four cycles. 


Unlike cache line transfers, I/O reads and writes from the cores may arrive at any time in any order. Each core 
may have a maximum of one I/O request outstanding (as of 4/25/2005), so the cache interface needs a place to 
store six I/O requests between arrival and completion. For I/O writes, the request arrives on the CmdAddr bus 
and gets stored in the StartloQ. The DMA sends a RDIO back to the core, and when the data to be written arrives 
on the Data bus, it is written to a location in the copy port. The cache interface asks the microengine to execute 
a special I/O write instruction which reads the data out of the copy port and writes it to the appropriate place 
inside the DMA engine based on I/O address. I/O reads are implemented in a similar way. The request arrives on 
the CmdAddr bus and gets stored in the StartloQ. The cache interface asks the microengine to execute a special 
I/O instruction which reads the appropriate register or memory and writes the result to the copy port. Then the 
cache interface arbitrates for and writes the response data from the copy port to the Data bus. The core must not 
perform more than one outstanding I/O request at a time, or the StartIoQ will overflow; assertions should check 
that this never happens. 


The following sections describe the cycle behavior of the cache interface as it performs several different tasks. 


5.2.11.6 Cycle Behavior: TaskStart to CmdAddr Bus 


This table describes how a memory transfer request enters the DmaCif through the TaskStart interface and 
eventually gets driven onto the CSW CmdAddr bus. The cycle numbers start with C5 because that’s the stage in 
the microengine pipeline that the Task is sent to the cache interface. 
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ORT/OWT changes | Changes to ORT/OWT appear in C8. 


Read TaskStart 


Drive CmdAddrReq 


CmdAddrGnt arrives 


Arbitrate CmdAddr 


Arbitrate Data 


TaskStart interface decodes the 
ue_cif_TaskStart signals, decides which 
queue the task will go into, and prepares 
to write one of the queues. The inputs 
to the queues and the write enables are 
flopped into C6 registers. 

Data and control signals for the queues 
are C6 flops. Data goes into the selected 
queue module. Each queue also generates 
an output in C6 (performing bypass if nec- 
essary) so that the queue select logic can 
peek at the head of each queue and decide 
whether it can issue or not. For example, 
BRDs cannot issue unless an ORT slot is 
available. One queue is selected (if any 
is eligible), and CmdAddrReq is asserted. 
The TID is provided by the ORT or OWT, 
which announces the next available slot. 
When CmdAddrGnt arrives, ask the ORT 
or OWT to fill a slot. The ORT/OWT slot 
that is filled corresponds to the TID that 
was driven onto dma_csw_CmdAddrTID in 
C6. 

FIXME: Secondary queues 

FIXME: ort and owt write 

FIXME: command completion, update 
counters 


BELOW IS THE ORIGINAL PIPELINE. 
PULL ANY USEFUL STUFF OUT, 
THEN REMOVE IT. 


Examine output of each queue and the 
empty flags. Decide which queue to ser- 
vice next (WriteQ, WriteExtendQ, ReadQ, 
ReadExtendQ). Only choose from a read 
queue if there is an empty slot in the Out- 
standingReadTable. Choose odd/even di- 
rection and assert CmdAddrReq. 


Drive the rest of the CmdAddr wires. Cm- 
dAddrGnt returns true or false. If true, as- 
sert DataReq if needed (reads don’t need 
it) and continue through the pipeline as 
usual. If false, stall CO and Cl, and con- 
tinue to assert CmdAddrReq and drive 
CmdAddr until it is granted once. 
Meanwhile, begin to read DW01 from 
DMA internal memory, so that if all goes 
well we can drive it in C2. 

NOTE: Arbitration failure in C2 can cause 


C1 to stall; in this case we must be sure not 


to bid for CmdAddr after winning it once. 
DW0O1 means doublewords zero and one. 
At start of C2, the DWO1 read is com- 
pleted, ECC bits is generated, and DWO1 
data is driven to bt cache switch. 

Later in C2, DataGnt returns true or false. 
If true, continue through the pipeline as 
usual and start the read of DW23 so that 
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5.2.11.7 Memory to DMA Pipeline 


Response arrives on incoming Data bus 


Cl ECC, Dispatch Check ECC on incoming data and correct 
single bit errors. Use transaction number 
as index into OutstandingReadTable, fig- 
ure out where this data should be written. 
Prepare to write data to DMA memory. 


C2 Write Write to DMA internal memory at the ap- 
propriate location. Clear this slot in the 
OutstandingReadTable. 


5.2.11.8 I/O Access Pipeline (Read and Write) 


1/O read arrives on incoming CmdAddr 
bus 


0 
1 There are 6 IoAccess slots for the 6 cores. 
2 


Start Uinst Drive the Startlo interface to the micro- 
engine to trigger an IOREAD microin- 
struction. The I/O read address is sent 
on cif_ue_StartloAddr. 


C Store Request 

Store some of the CmdAddr parameters 
into the IoAccess slot for the requesting 
core. If the request is a read, continue 
through this pipeline. If the request is 
a write, disable the rest of this pipeline. 
There is nothing more to do until the Data 
arrives, one or more cycles later. See I/O 
Write pipeline below. 


core. The transfer is recorded in WriteQ 
and processed by the DMA to Memory 
pipeline, above. 


C3-C7? | microengine pipeline IOREAD instruction travels through mi- 
croengine pipeline. The result is written to 
registers (per core) in the copy port, then 
the microengine requests a memory trans- 
fer from the copy port address back to the 
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5.2.11.9 I/O Write Pipeline 


I/O write arrival r O write data arrives on incoming Data 
bus. 


Read IoAccess Use the target core number to index into 
IoAccess and retrieve the CmdAddr por- 
tion of the I/O write transaction. Now we 
have enough information to begin. Prepare 
to write the data to a register (per core) in 
the copy port. 


Write Copy, Start Uinst | On rising edge of C2, data appears in 
copy port. Drive the StartIo instruc- 
tion to trigger an IOWRITE microinstruc- 
tion. The I/O read address is sent on 


cif_ue_StartloAddr. 

microengine pipeline The IOWRITE instruction travels through 
microengine pipeline. The instruction 
reads from the register in the copy port, 
and the result is written to the address 
specified by the I/O write request. Then 
the microengine requests a memory trans- 
fer back to the core with a special flag to 
mark it as an I/O write response. The 
transfer is enqueued in WriteQ, then en- 
ters the DMA to Memory pipeline, above. 


5.2.11.10 ‘Task interface pipeline 


TaskStart arrival TaskStart signal arrives from microengine. 
Prepare to read MemAddr memory using 
TaskThread as the address. 


Cl Enqueue Prepare to write memory address and 


length of transfer into either the Wrq or the 
Rdq. Compute new values of NumPending 
registers. 


C2 Report Send Pending/Full status for each thread 
back to microengine. 


5.2.12 Microengine Programming 
5.2.12.1 Instructions 


The microengine instructions contain the following fields. The microinstruction memory contains DMA_UIM_WORDS 
words (presently 1024 as of 10/27/05). 
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Control Field Bits 


3 
Sleep index 


Toa —SSsC—~—~—~—SYSC‘i 


The control store needs to be accessible via JTAG; any other path is simply convenience. The DMA engine 
should be held in reset state (no requests allowed out from cache or switch interfaces) while the control store is 
being written. 


— 


5.2.12.2 Operand selection 


Microinstructions need the ability to access certain state variables by special addressing functions. Each of these 
values must be set up in the thread state before the corresponding variables can be accessed. 

The current packet buffer is identified by both the port being serviced and the specific packet to or from that 
port, which is selected by hardware on a FIFO basis. 


5.2.12.3 Destination Selection 


TBD 


5.2.12.4 ALU operations 


In addition to the typical add, subtract, and boolean ALU operations, I imagine some unusual ops, combining 
two or more “operations” in one opcode because the data required by those operations are all available concurrently. 
These include calculating address and remaining length in a buffer, queue access, and heap access length checks: 


Priority Encode Priority encode looks at operand A bits 31:0 to find the least significant bit that is 1. The 
result equals the bit number of the least significant bit that is 1. If no bits are set in A<31:0>, the result is zero. 


PID Match The ALU compares a 16-bit value taken from bits 31:16 of the packet trailer with a 16-bit field from 
the Control/Status register file. [?? how to combine comparison test with type dispatch??] 


Pointer Update The ALU A operand is a 64-bit value with a physical address in bits 35:0 and a (negative) 
buffer length in bits 63:36. The B operand is a 28-bit payload length value. The ALU adds the payload length 
to the address, and adds the payload length to the negative buffer length. The address portion of the A operand 
(not the sum) is available to the memory address register; the ALU output is available to be written back to the 
data memory. Branch functions will report whether the buffer length has become positive. This function is used 
for DMA buffer pointers and queue access. 

ALU<35:0> = A<35:0> + B<27:0> 

ALU<63:36> = A<63:36> + B<27:0> 

Address = A<35:0> 


Pointer Distance The operands are pointers, with addresses in bits 35:0. The result has the low 28 bits of their 
difference in bits 63:36, and the unmodified A operand address in 35:0. 
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Pointer Extend The result is zero in 63:36, and the difference A<35:0> - B<63:36>, sign extended, in 35:0. 
This is used for calculating the end of a buffer region. 


Offset The A operand is a buffer pointer as in Pointer Update. The B operand is a 28-bit offset in bits 27:0. The 
ALU adds the B operand to the buffer address, making the sum available to the memory address register. The B 
operand is compared against the buffer length in bits 63:36 of the A operand. Branch functions will report if the 
B operand is greater than the buffer length. This function is used for calculating a heap address and checking that 
the offset is in range. 


Swap Offset Like Offset, except that the B operand is in bits 59:32. 
Swap Halves A operand bits 31:0 become result 63:32, and B operand bits 63:32 become result 31:0. 


Munge The B operand is rotated and masked and or’d according to bits of the A operand. A<5:0> encode 
a right rotation of the B operand. Bits A<39:8>, are ANDed with bits <31:0> of the rotated value, and bits 
A<63:40>, are XORed with bits <23:0> of the rotated and anded result. The boolean masks are msb extended 
with bits 39 and 63, respectively. 


MergeO, Mergel, Merge2, Merge3 


A merge operation combines operand A and operand B in a programmable way. When loading the microcode, the 
R_SDmaMergeOpHi/Lo registers are initialized with values that control the behavior of the MergeN instructions. 
When microcode executes a MergeN instruction, bits from operand A and operand B are combined according to 
the values in RLSDmaMergeOpHi/Lo. A 1 in the register causes that the corresponding bit will be selected from 
operand B, while a 0 selects from operand A. 


The merge is implemented as follows: 


For bit from 63 to 32, 

Result [bit] = R_SDmaMergeOpHi[X] [bit-32] ? opb[bit] : opa[bit] 
For bit from 31 to 0, 

Result [bit] = R_SDmaMergeOpLo[X] [bit] ? opb[bit] : opal[bit] 


5.2.12.5 Sleep Functions 
These functions put a thread to sleep (getting no datapath cycles) until a specified event occurs: 
Memory transfer completion Wait until a memory transfer has finished and any associated resources can be 


reused. For writes, this means that the data has been written to the cache switch. For reads, it means that the 
data is available for the next instruction to use. 


Packet buffer available After this instruction, wait until there is a packet buffer available for this thread. This 
is only defined for txN/rxN/copy port threads which are associated with a port. A receive thread would awaken 
when a new packet is received from the fabric switch. A transmit thread would awaken when a packet buffer is 
empty and ready to be build. 


Command Arrival After this instruction, wait until a new command arrives from a processor. This would be 
used in the queue manager, which should not waste cycles polling. 


Sleep Forever This instruction causes a thread to sleep indefinitely. 
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Take Mutex Mutexes are provided so that microengine threads can safely access shared resources such as queues 
and contexts. A typical scenario is that several threads need to write to an event queue in memory. If all the threads 
read the queue pointer, write to memory, and write the queue pointer in parallel, then events would get overwritten 
or lost. Instead, each thread obtains a mutex for the queue, which guarantees exclusive access to the queue pointer 
and the queue memory. Then the thread reads the queue pointer, writes to memory, updates the pointer, and 
releases the mutex so that another thread can have its turn. 

The Take Mutex function causes the thread to sleep until it owns the mutex identified in the Sleep Index field. 
The following instruction is allowed to read/write the shared resource, with assurance that no other user of the 
same mutex is in a critical section. If the mutex is already available, Take Mutex allows the thread to execute again 
with only the usual delay. 


Drop Mutex The “Drop Mutex” function releases a mutex to make it available for other threads. The hardware 
guarantees that no more than one thread will have ownership of the mutex at any time. The instruction which 
specifies Drop Mutex is allowed to read/write the shared resource, but any subsequent instructions in the thread 
must not. 


5.2.12.6 Stall 


The 3-bit stall field encodes the number of cycles that the dma engine must wait before the current thread is 
permitted to bid for next use of the datapath. Typically, this field defaults to 1, which ensures that alu results and 
branch conditions from the current instruction are available for the next. It must be greater than 1 in instructions 
which issue a memory request or release a packet buffer and wait for it (exact value TBD); it may be zero in 
instructions whose successor does not depend on any result of the current instruction. 


5.2.12.7 Memory Transfer 


Microcode needs to be able to initiate and sometimes wait for completion of memory transactions, for packet 
payloads, queue entries, and buffer and route descriptors. Microcode should be able to specify reads and writes of 
up to 128 consecutive bytes; reads should be aligned to 8-byte boundaries, writes to 32-byte boundaries. I want to 
be able to initiate transactions of up to 128 bytes all together, rather than waiting for completion of one 32-byte 
cache block before starting the next; this may prove complex, and probably results in a different L2 interface for 
the DMA engine than that used by the processors. Reads and writes of the packet buffers may start at the second 
or third word of the packet, and are governed by the packet length register. 


5.2.12.8 Branch Functions 


Some of the branch functions can be arranged to evaluate a small number of bits at the operand register, others 
must test the alu output. This is important because it determines the branch latency and thus the microinstruction 
rate of each thread. If we assume that operand access takes one cycle and microinstruction access takes another, 
that means we can execute instructions from a given thread every second or third cycle. 


e Queue pointer test: is queue_pointer + entry_length > queue_limit? (ALU N) 
e Buffer descriptor test: is packet_length > buffer_remaining? (ALU N) 

e Buffer valid test: is buffer descriptor valid? (combined with above; ALU Z?) 
e Type dispatch: PID-Match, Trailer<11:8> 

e Queue entry dispatch: decode queue entry 

e Port select decode from RDT entry 

e Segment length check: ALU<63:32> <0 (ALU N) 

e Context match: Aop<31:16> = Bop<31:16> (ALU Z) 

e Sequence number match: Aop<63:32> = Bop<63:32> (ALU Z) 


e Queue empty? full (room for more)? (ALU N) 
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e Completion notification? 
e Error detected? 


I do not presently see a need for subroutines, especially if we have a general case dispatch for port select and queue 
entry decode. 

There’s a question how a thread sleeps when it’s waiting for memory or an available packet buffer or a mutex 
with another thread. 


5.2.12.9 Next Address 


In the absence of a branch function, each microinstruction address is the contents of the NextAddr field of 
the previous microinstruction of the same thread. Branch functions substitute conditional values for some of the 
low-order NextAddr bits. 


5.2.13 Unified Engine 


There are several splits one could imagine in the DMA Engine; I have chosen a unified approach because I 
wanted to build sufficient capacity to saturate the pin bandwidth at the memory and switch interfaces, and I 
wanted that capacity to be available to any thread that needed it. I am hopeful that the transmit side and the 
receive side can each be implemented as a pipelined microengine with four threads serviced on a demand round- 
robin basis. Branching instructions (virtually all) would probably require a latency of two cycles: (cycle 1) Branch 
address computation, microinstruction fetch; (cycle 2) read operands; (cycle 3) run alu, write result. Bypass is not 
necessary, except perhaps for global variables, because branch prohibits executing the same thread next cycle. 


5.2.14 Bandwidth 


The available main memory bandwidth, assuming two ports of 400 MHz DDR2 memories 8 bytes wide, is 6.4 
GB/sec, rising to 12.8 as DDR data rates rise to 800 MHz. The L2 cache can deliver 16 GB/sec on hits, but every 
miss costs two L2 cycles. As a result, with full memory bandwidth demand there is only 3.2 GB/sec available for 
hits, so the available bandwidth drops to 9.6 GB/sec when the memory is busy. [I’d really like to improve the 
L2 bandwidth, if we can do that cheaply. Easiest change seems to be to bury writes under reads of the opposite 
half-line.] 

The possible demand from the switch consists of 6 ports at 2 GB/sec, derated by the 8/10 code, resulting in 
10 GB/sec; but realistic load conditions further derate that demand by factors of 1/5.5 (to account for average 
path length) and 88% (payload fraction of packet size), to about 1.6 GB/sec with uniformly distributed traffic. For 
communication between two nodes on an otherwise idle fabric, we should be able to sustain 4.2 GB/sec of payload 
delivery. 

In the point-to-point DMA case, we will have three input or output ports running at full bandwidth, each 
transfering a packet every 95 ns, more or less. This means that in a fully-unified DMA engine no stage in the 
pipeline should dedicate more than 32 ns, or 8 cycles, to one packet; that means the payload cannot be copied — it 
must be transferred directly between the switch buffer and the cache. 


5.2.15 Matching 


Content-addressable memory is expensive in power and area, so we’d prefer not to keep large numbers of receive 
contexts ready to match each incoming packet. Still, we do need to be able to keep several contexts open at a 
time; the compromise is to build the equivalent of a direct-mapped cache. A few bits of the context-id are used to 
index the receive context space, and the remaining bits compared with the id stored at that location. If there is no 
match, the packet is assumed to be a remnant of an aborted transfer and will be discarded; we’ll count such events. 
Similarly, each received packet carries a 16-bit process id and a 4-bit process index in the trailer; the process index 
is used to address the memory which holds control/status pages, and the selected page is checked to match the 
process id, which must match to use the process variables. Process index 15 is reserved for non process-specific 
packets, and indexes 12-14 are reserved for global variables. 
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5.2.16 Interface registers 


The DMA Engine needs controls for both user and kernel threads, and we need to protect some of the controls 
of the user space from access by user mode. Still, we can use a pair of pages for each processor, one user-writable 
and the other only writable by the kernel to define the required variables for both threads. 

It’s also important to think carefully about which information needs to be in interface registers, which are 
presumably uncachable, and which should be in cacheable memory, where we can use more efficient cache-block 
transfers. [We could implement the control/status registers in such a way that they were cacheable; we’d just have 
to remember ownership, and invalidate the owner whenever one changed.| 


5.2.17 Coherence 


We clearly need a coherent model of the interface between software and the DMA Engine control interface. 
We also need to ensure that accesses by concurrent segments see a coherent memory interface, whether tested by 
streams from different remote nodes or by overlapping stride patterns. 

I am hoping that we can achieve coherence with the processors by means of a simple interlock which a thread 
grabs when it reads memory for modification, and releases upon write. The interlock should delay intervention 
responses, and if we make its use exclusive among threads, it will ensure coherence among threads. I still need to 
explain how a thread waits for the lock to become available; can we inhibit its bid for cycles? 


5.2.18 Alignment 


For efficient operation of the DMA engine, both transmit and receive buffers should be aligned on 8-byte 
boundaries and transferred data sizes should be multiples of 8. There are at least three distinct performance levels: 
highest performance is achieved using contiguous data in large buffers aligned to 64-byte boundaries; the DMA 
engine can achieve intermediate performance with efficient transfers of data in multiples of 8 bytes, aligned to 
8-byte boundaries. Transfers that do not meet these criteria must be handled by software, and suffer significantly 
higher penalties. 


5.2.19 Strides and Scatter/Gather 


For aligned contiguous transfers, the DMA engine has a substantial performance advantage over the processor 
cores, in that it can copy entire cache blocks at a rate of one per cycle (8 GB/sec). This advantage disappears 
in strided or scatter/gather operations, where each packet may require multiple memory references and must be 
assembled and disassembled piecemeal. With a reasonable datapath width (say, 8 bytes), the peak transfer rate falls 
to 2 GB/sec, the same as the peak copy rate of a 5Kf core. The DMA engine still has the potential of substantially 
reduced overhead, because of having been designed specifically for the purpose, but that shows up as overhead 
hardware (parallel 32-bit adders, for example). A software implementation, conversely, would have the option of 
defining special cases to eliminate some of the overhead. 

If we get rid of sub-block access in the DMA engine, we’ll also force Enq_* and Wr_Heap to cache block sizes. 


5.2.20 Output Thread 


There is an output thread associated with each output port of the switch, and one with the copy function. Each 
such thread has at least three output buffers, and the thread works on setting up one, while the cache is filling 
the second, while the third is being emptied by the switch. When it finishes setting up the cache requests to fill a 
buffer, the thread waits for availability of the next buffer. If there is a ready transmit context for this output port, 
it builds a packet in the buffer, enables it for output, and returns to the top. See the pseudocode (??) for a more 
detailed flow. 

The first microinstruction of an output thread is executed with process variables and transmit context selected. 
It copies the routing information from the transmit context to the packet header. The second writes the Packet 
Type and Process ID to the trailer ford. It tests whether the context is for a DMA packet, and if not, whether the 
payload comes from the queue entry or the heap. 

If the payload is from the queue entry, the third microinstruction loops copying payload to the packet buffer 
(or loads it from queue memory). 

If the payload is from the heap, the third microinstruction checks the heap offset and length, and the fourth 
reads from the heap to the packet buffer. 
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If the packet is DMA, the third microinstruction checks the BDT entry; the fourth initiates a memory read, 
updates address and length, and checks the BDT again; the fifth initiates a second read and checks the segment 
length. If the segment is done, the sixth microinstruction pops a new transmit context from the appropriate queue. 


5.2.21 Input Thread 


There is an input thread associated with each input port of the switch. Each such thread rotates among three 
input buffers, making one available to the switch as it interprets the control information in the next, and the cache 
stores the payload of the third. Upon completing a packet buffer, it waits for the next to be full, then finds the 
receive context that matches this packet. Finding one, it sets up the cache requests to store the payload according 
to the context, writes any event queue entry required, writes any response queue entry required, and returns to 
the top. If there is no matching receive context, the input packet is discarded. See the pseudocode (??) for a more 
detailed flow. 

For each received packet, the first microinstruction executed is selected by the packet type field, and the hardware 
uses the 4-bit process index to select one of the control/status pages to control heap and queue accesses. In all 
cases, the first instruction checks that the PID matches that in the selected C/S page. 

The first cycle of an Eng_Tx, _Rx, or _Direct tests the queue pointers for the appropriate queue to make sure 
there is room for the new packet. The next cycle either writes the packet to the queue (one or two writes required) 
while updating the pointers; wraps the queue pointer if needed, then writes the packet; or sets an error indication 
because the queue has overflowed. 

The first cycle of a Wr_Heap tests the offset and length against the heap size. The second cycle either writes 
the payload or sets an error indication because the store is out of bounds. 

The first cycle of a DMA selects the receive context and checks for a match; it checks for a packet sequence 
number match, and it checks that the current BDT entry has room for the first cache line of the packet. The 
second cycle writes the first cache block, increments the address, decrements the buffer length, and checks that the 
BDT entry has room for the second cache block. The third cycle writes the second block, increments the address, 
decrements the length, and checks whether the message segment is complete, and whether a notification is needed. 
If it is, the fourth cycle pushes the Ack onto the transmit foreground queue. In either case, upon completion of a 
segment, a new receive context is popped off the receive queue. 


5.2.22 Thread performance 


There are tight performance constraints on packet processing: in point-to-point communication, a full-sized 
DMA packet may arrive every 95 ns on each input port (152 bytes * 10 bits/byte / 16 bits/ns). Assuming a 250 
MHz (4ns) clock in the DMA engine, we have 24 cycles in which to service 3 packets, or 8 cycles per packet. To 
achieve this goal, we’ll need to be very efficient in our use of cycles, especially in dispatching to the appropriate 
routine for each case. 

The first microinstruction can be selected by the hardware on the basis of packet type and validity. Subsequent 
microinstructions can be pipelined to select register file operands, alu operations, and branch decision before fetching 
the next microinstruction of the same thread. 

Short-message latency imposes an additional constraint: of 500ns total, the fabric requires about 180, leaving 
320ns for DMA Engine and library software. We want to make sure that when the Tx command queue is empty, 
new entries are passed directly to the output thread without diverting through memory, and received packets are 
available on the event queue with absolutely minimal overhead. 


5.2.23 Queue manager 


The queue manager state machine is activated whenever a new entry is stored in the command registers or an 
input thread has something to enqueue for response. It checks the queue entry for validity (allowed buffer and 
route descriptors) and copies it to appropriate memory for access by the input and output threads. A related state 
machine writes event queue entries. Figure 5.6 is a schematic representation of the queue Manager process for 
transmit and receive queues. 

When a transmit context is completed, the queue manager pops the next item off the transmit queue associated 
with the same output port. When a receive context is completed, the queue manager pops the next item off the 
receive queue associated with the input port, and assigns a new context id. It inserts the new context id into a 
Get request (the rest of which was in the receive queue entry), and passes the Get request to the bypass transmit 
context of the output port selected by the route. 
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Figure 5.6: Queue manager 
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5.2.24 Port manager 


Each input and output port has a state machine which manages three or more packet buffers, such that one 
of them is assigned to the switch port and is used for injecting a packet into the fabric or receiving one from the 
fabric; one of the packet buffers is assigned to the cache, and sequences the transfer of up to two aligned L2 cache 
lines to or from the cache, using an address provided by the DMA engine; the third packet buffer is assigned to the 
port’s input or output thread in the DMA engine, which can read or write it under microcode control. The roles 
assigned to the three buffers rotate when all have completed their respective tasks, or have nothing to do. [It might 
pay to have four packet buffers per port, so that variations in processing time can be absorbed without degrading 
performance. ] 


5.2.25 Copy Thread 


There is also a low-priority thread which performs memory-to-memory transfers; it appears that the simplest 
implementation treats the copy thread as an additional input and output port which software can treat as if it 
were simply another interface to the fabric with a loopback destination. I’d like to augment the copy function with 
a useful crypto function (for encryption/decryption of TCP/IP traffic) and a zero-memory function (for use by 
the page-creation software). And if the fabric processor is going to be responsible for strided and scatter /gather 
operations, it would be helpful if the copy thread could be invoked to prefetch memory along strided or scattered 
streams. 


5.2.26 Timeouts 


We need to be able to detect lost packets or broken links without incurring significant software overhead. One 
way to do that would be to set a timer on each active receive context, and complain about any that exceeded a 
software-defined maximum value, either since initialization, or since the last received packet. Transmit contexts 
may not benefit, though it would be desirable to invoke software if unable to emit a packet over some defined time 
period. 

Timers on receive contexts won’t detect lost Acks or Rendezvous requests, nor anything which doesn’t have a 
dedicated receive context. 


5.2.27 Error Conditions 


e Correctable ECC error on memory access: correct the error, capture address and syndrome in error registers. 
e Uncorrectable error on memory access: disable the thread with the error, capture address, interrupt host. 
e Receive packet with “poison” type: count and drop packet 


e Receive packet with process id mismatch: enqueue for fabric processor? 
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Receive DMA packet with context id mismatch: count and drop packet 


Event queue overflow: Interrupt 


e Any command queue overflow: Interrupt 


Route or Buffer handle out of bounds: Interrupt 


e Invalid Buffer Descriptor: Interrupt, push current context onto event queue, invalidate context. 


5.3 Notes 


5.3.1 Rendezvous 


Rendezvous performance is pretty important, because any latency becomes message overhead, so we'd like it to 
be handled in hardware if at all possible. The rendezvous response, when it returns, should start the dma transfer, 
including setting up any shifting necessary to get the packets aligned to cache blocks at the destination, and setting 
the segment packet count to stop at a page boundary. 

The communicator data structure can include an array indexed by rank, where each item in the array is the 
head of an ordered list of posted receives, and there can be a separate list of receives for rank-any. The structure 
representing a posted receive could include an sequence number, so we could determine which is oldest. But are 
we then going to hash on the tag? 

Rendezvous is also responsible for calculating the alignment of segments to ensure that packet payloads are 
aligned to cache blocks at the receiver. [The transmitter is in a much better position than the receiver to do the 
alignment, because the packets may arrive at the receiver from multiple interleaved streams. The receiver would 
also have to do read-modify-writes if it got partial lines.] 

Therefore, the rendezvous request includes the low bits of the source buffer starting address and the buffer 
length. The rendezvous response carries (in the datatype field) the number of bytes by which to shift source cache 
blocks to align them with destination cache blocks, and the length of the segment. I intend the rendezvous response 
to be coded as a Get: the relevant parameters are in the transmit context to initiate the transfer. 

DMA transfers always reserve a receive context before queueing for use of a transmit context; this prevents a 
potential deadlock which could occur if some transfers reserved transmit contexts first, then queued waiting for 
receive contexts. 


5.3.2 Ethernet simulation 


We'd like to be able to pretend that the fabric is one large ethernet, so as to use TCP and UDP services with 
a minimum of new development. For that purpose, there should be a driver with the same API as the ethernet 
driver, but which converts MAC addresses to routes through the fabric or broadcast. 


5.3.3 Barrier 


When any communicator is created, the data structure in each node includes space reserved for barriers and 
collective operations on that communicator. The nodes which participate in the communicator are partitioned into 
a tree or multidimensional network which will be used for barriers and collectives. The data structure in each node 
describes where this node is in that network. Specifically, how many inputs are required at this point in the network 
to complete a barrier, and where to send notification of barrier stage completion. 

If we have explosive broadcast, I expect that a collection tree followed by broadcast notification will be the most 
efficient implementation of barriers. Without it, we may find that a single-pass multidimensional exchange works 
better, in spite of needing more messages. 

In the two-pass tree-structured implementation, most nodes are leaves, some are intermediate, and one (arbi- 
trarily chosen, from the perspective of the MPI user) is the root. Leaf nodes, upon encountering a barrier, send a 
packet containing the communicator id to their designated intermediaries, which mark receipt from each leaf and 
the local process in the communicator data structure. Upon receipt of the last notice, an intermediate node sends 
notice to its designated superior, in just the way that each leaf did to the intermediate node. The root, rather 
than sending notice to a superior, responds with a broadcast which notifies all ranks of the communicator that the 
barrier has been passed. Intermediate nodes must then reset the communicator data structure in preparation for 
the next barrier. [This seems to imply a race. Maybe barriers and collectives should carry a generation number.| 


May 14, 2014 240 Rev 51328 


SiCortex Confidential 5.3. NOTES 


If we use a single-pass implementation, there is no distinction between leaf, intermediate, and root nodes. The 
communicator is factored by some small integer radix r, and each node exchanges messages with r—1 other nodes at 
each stage of the barrier process. The barrier is complete after k stages, where * > N, the size of the communicator. 


5.3.4 Cache interface 


If we provided a single 8-byte wide interface to the L2 cache, operating at 250 MHz, the peak achievable 
bandwidth would be 2 GB/sec, only half of the goal. I think the answer is that the buffers for each input or output 
port should interface a 64-bit bus with its own path to or from the L2. Each buffer needs to be able to handle 
out-of-order completion of reads, because some will be found in cache while others are in memory. 

Writing toward memory is easier because the data transfer will occur at a fixed time with respect to successful 
arbitration and address transfer, so the packet buffer can schedule each write as the last one is completing, and the 
path from a packet buffer to the cache data bus simply carries one doubleword per cycle. 

We will require receive packet payloads to be aligned with the memory at 32-byte boundaries, corresponding to 
a half line in the L2 cache. This eliminates the need to read a block from memory before writing the payload over 
it. 


5.3.5 Performance Counters 


It’s important to be able to measure and understand the performance of the communication fabric and to be 
able to characterize the load presented by an application. Some such data can be gathered by library software, 
noting initiation and completion times of messages, their lengths and other characteristics. But some information 
is undoubtedly best obtained by instrumenting the hardware. So far, I don’t know what to measure in the DMA 
engine. 


e For measurement of traffic through the switch, I’d like to have a performance sampling bit in the packet 
header, and accumulate a total of the time spent in the switch for all packets with the sampling bit set. An 
implementation of that function would attach a packet arrival time to every packet, and calculate the running 
sum of the difference between arrival and departure times for packets with the bit set. 


e I’d also be interested in knowing the number of cycles in which each virtual channel is blocked (has no 
apparently free buffers in the downstream node). 


e Packets received/transmitted per DMA engine port (by type?) 


e Memory ECC errors (count, or simply flag?) 
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5.4 Registers and Definitions 


[$Id: DmaRegs.lyx 46805 2007-10-30 21:33:40Z denney $] 


5.5 Microengine Instructions 


5.5.1 Instruction Fields 


Class 

DmaUelInst 

Attributes 

age] | opaMode | DmaUelnstOpa | [operand Aaddresing mode SSS 
OS mR 
a0I1:9[ | opbMode_[_DmaUeinstOpb__[ [operand Baddressing mode SSS 
Pangrr-| eps Ppr8d 
a0[0s18| [dashed | —DinaUciaaiest —| | sitresng mode Tor destnation 
qOpe2T| | dest [_————S—S*d~SS—S~*destination index SSCS 
aojsrr] [am Dinaatnst A [TP ALU operation 
lave ne | Tenn ee 


ee |}memWrAddr [oss | write MemAdadr register for current thread 


fee 38:36 memLenSel STS it where the memory transfer length comes from, 
either a constant or from the payload length in a port. 


memLast For threads 0-7, memLast=1 means “This is the last in- 
struction that refers to this packet.” If 1, the currently se- 
lected port is notified that the microengine is finished with 
the packet buffer. In the DMA_THR_IO_ACCESS thread, 
memLast=1 informs the StartIo interface that the micro- 
engine is ready for another I/O operation. For I/O reads 
this causes the cache interface to send the I/O data back 
to the processor. In the I/O thread, when memLast=1, 
the memOp must encode NONE, sleepMode must encode 
hwFlag, and sleepIndex must encode NONE. 


aora0T | DmaVelnstSlep [|_| sleeprequest 


aera 42 sleepIndex which condition or mutex is indicated in sleep field. The 
encoding is DmaUelInstSleepCond or DmaUelnstSleep- 
Mutex, depending on sleepMode 
Ce 


| nextAddr | next Tnext address SOSC~—SSSSS 


Ee a of cycles to delay before this thread may execute 
another instruction 
d0[63:0] allBits for vspecs to read the first doubleword as a bit vector. 
Overlaps allowed. 


5.5.2 Operand A addressing modes 


This section describes the values that can go into the opaMode field of the microengine instruction. 
Enum 

DmaUelnstOpa 

Attributes 


-allowle 
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specialReg | Special Operand A registers, see table below 


Read from dmem. The 10-bit address is ptrO<9:0> xor 
(opaldx<5:0> shifted left by 4). 


Read from dmem. Same as above, but with ptrl. 
Read from dmem. Same as above, but with ptr2. 


3'd4 ptr3 Read from dmem. The 10-bit address is ptr3<9:0> xor 
(opaldx<5:0> shifted left by 4) xor processIndex<3:0>. 
For threads 0-3, the processIndex comes from bits 15:12 
of the trailer FORD in the selected receive port. For the 
I/O thread, the processIndex<3:0> comes from the I/O 
address bits 19:16. 


ee! Read from dmem. The 10-bit is is ptr4<9:0> xor 
(opaldx<5:0> shifted left by 4) 


ous Read from dmem. The 10- bit address is opaldx<5:0> 
concatenated with 1111. 


5.5.3 Operand B addressing modes 


This section describes the values that can go into the opbMode field of the microengine instruction. 


Restriction: The dmem is divided into four banks, and each bank can only read one address at a time. If 
operand A and operand B select different addresses in the same dmem bank, operation is undefined. 


Enum 
DmaUelInstOpb 
Attributes 


-allowlc 


Same as ptr4 in operand A except opblIdx is used. 
Same as ptrd in operand A except opbldx is used. 


3'd7 memRead | Thread-specific buffer for microcode to read cache blocks. 
The buffer can be filled using the memory fields of the in- 
struction. There are 16 doublewords of data in the buffer, 
selected by opaldx<3:0>. 


5.5.4 Destination Addressing Modes 


Enum 
DmaUelInstDest 
Attributes 


-allowle 
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Same as ptr4 in operand A except destIdx is used 
Same as ptrd in operand A except destIdx is used 


3’d7 memWrite | Thread-specific buffer for microcode to write cache blocks, 
indexed by destIdx. The buffer can be sent to memory us- 
ing the memory fields of the instruction. There are 16 dou- 
blewords of data in the buffer, selected by destIdx<3:0>. 


5.5.5 Special Registers addressed by Operand A 


Enum 
DmaUelInstSpecialOpa 
Attributes 

-allowlc 


THOO 


6’h10 thread0Ptr | Read THREADO_PTR register, pointer state for the Rx 
port 0 thread. This is used to implement I/O reads of 
thread state registers. Also, it allows a thread to read its 
ptrN values. 


GhIT_| thread 7Pir 
6’h18 thread8Ptr | Read pointer state for Queue Managers pointer state for | Read pointer state for Queue Managers Manager 


pier | thread9Ptr | Read pointer state for I/O service thread: 
THREAD9_PTR register 


6’ ad spcelData | Returns the 24-bit data from the most recent SPCL op- 
eration that arrived on the CSW. This register is only 
used in microcode that handles SPCLs. To compute 
spclData, concatenate 40 zeroes, ioAddr<35:20>, and 
ioAddr<15:8> to make a 64-bit value. 


67h1F ioAddr Returns the address of the current I/O read or write. This 
is used in microcode that implements programmable I/O 
reads or writes. 


5.5.6 Special Registers addressed by Operand B 


In this table, the constant represents the value to be used in opbIdx when the microinstruction reads special 
registers. To read a special register, set opbMode to SPECIAL_REG. The registers whose names start with “io” 
are used to implement I/O reads and should not be used in normal microcode. 


Enum 
DmaUelInstSpecialOpb 
Attributes 

-allowlc 
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6’h00 The value 0. 
aan’ hiF Loe | Returns the data of the current I/O write. This is used 
in microcode that implements programmable I/O writes. 


6’ aaeat ical: Read the packet header FORD for the currently selected 
receive port. The receive port number is based on the 
thread number. 

67h24 pktCtl Read the packet control FORD for the selected receive 
port. If there is no control FORD, according to the hasC- 
trl bit in pktHead, the pktCtl register retains its value 
from the last packet that did have a control FORD. 


67h28 Read the packet trailer FORD 


6’ h2C pktLen Read the packet payload length for the selected receive 
port, in units of bytes. The payload length may be be- 
tween 8 and 128 bytes, but always a multiple of 8. The 
header, control, or trailer FORDs are not counted as pay- 
load. 


6’ Pees eee Read the first doubleword of payload in the currently se- 
[eee receive port. Continues until... 


6’h3F pktPayload15 | Read the sixteenth doubleword of payload in the currently 
selected receive port. 


5.5.7 Special Registers addressed by Destination 


In this table, the constant represents the value to be used in destIdx when the microinstruction writes special 
registers. To write a special register, set destMode to SPECIAL_REG. The registers whose names start with “io” 
are used to implement I/O reads and should not be used in normal microcode. 

Enum 

DmaUelInstSpecialDest 

Attributes 


-allowle 
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6’hO0 trashcan Null destination; used when an instruction does not write 
any result. 


6’h08 ptr0 Write ptr0 register in thread state from bits 9:0 of the 
ALU result. If the modified pointer is used in the next 
instruction, the instruction that writes the pointer must 
set Stall to at least 3 to avoid a pipeline hazard. 


3 


6’hOF ioData Write ALU result to the I/O read response buffer. This is 
used in microcode that implements a programmable I/O 
read. 


6’7h20 pktHead Write packet header FORD for the currently selected 
transmit port. The transmit port number comes from the 
portSel field in thread state. The NumFords field of the 
header is calculated as the payload length in fords plus 2 
(header and trailer) plus 1 if HasCtrl is set. 
The payload length must be set before writing the header. 

67h24 pktCtl Write packet control FORD for the selected transmit port. 
The control FORD is only written to the fabric switch if 
the hasCtrl bit in pktHead is set. 


6 [eee al pktTraill28 | Write packet trailer FORD, setting the packet payload 
length to 128 bytes. 


6’ Kote pktTrailLen | Write packet trailer and payload length for the currently 
selected transmit port. The payload length is taken from 
bits 7:0 of the alu, must be between 8 and 128, and always 
a multiple of 8. If hasCtrl=0 in the header, payload length 
must be between 16 and 128 bytes. 

The header, control, or trailer FORDs are not counted as 
payload. 
The length must be set before writing the packet header. 
6’h30 pktPayload0O | Write the first doubleword of payload in the currently 
selected transmit port. 
NOTE: When writing any of the pktPayloadN regis- 
ters, you must ensure that any outstanding memory 
reads from the previous packet (before memLast) 
have completed; otherwise the current AND previous 
packet may be corrupted. See bug 2297 for further anal- 
ysis. 


6’h3F pktPayloadl5 Write the sixteenth doubleword of payload in the cur- 
retnly selected manera = 


Rev 51328 


SiCortex Confidential 


5.5.8 ALU Operation Field 


Enum 


DmaUelInstAlu 


Attributes 


-allowle 
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Wa Produ 
eS Se co 
Le a 


5’d9 priorityEncode | result<4:0> = priority encode of | result<4> A<31:0> 
A<31:0>. The result is the bit number 
of the lowest bit of A that is set, or zero 
. upper result bits are 0. 

-wai0 | pidMiatch [compare 16-bit value from bits 316 | __0 | XORSBTTGS___—_ 

5’d12 ptrUpdate Pointer Update: memAddr = A<35:0>; alu<63> alu<63:36> 
Alu<35:0> = A<35:0> + = Zext 
B<27:0>; 
Alu<63:36> = A<63:36> + B<27:0> 

5’d13 ptrDist Pointer Distance: Alu<35:0> = alu<63> alu<63:36> 
coe 


an | d14 ptrExtend Pointer Extend: Alu<35:0> = A<35:0> alu<63> Po 36> 
oe ee eee 
5’d15 offset calculate heap address and check off- alu<63> alu<63:36> 
set. Alu<63:36> = A<63:36> + 
B<27:0>; Alu<35:0> = A<35:0> + 
Zext B<27:0> 
5’d16 swapOffset calculate heap address and check off- alu<63> alu<63:36> 
set. Alu<63:36> = A<63:36> + 
B<59:32>; Alu<35:0> = A<35:0> + 
Zext B<59:32> 
5’d18 subLow32 Subtract in low 32 bits only. 
Result <31:0> = A<31:0> - B<31:0> 
Result <63:32>=A<63:32> 
N = Result<31> 
Z based in Result<31:0> only 


5°d20 cacheRead result = B<63:0> | eee 0> 
Pe eee 


5’d21 cacheWrite Write two dmem locations. B<63:0> is alu<63> result <63:0> 
the alu result, and is written to destina- 
tion address, which must have bit 4 = 1. 
MemAddr<35:0> is written to destina- 
tion address minus one. 


5°d23 munge Rotate/and/xor operations on Operand B<63> result <63:0> 
B, controlled by bits of Operand A. 
First rotate right by opa<5:0>. Any 
bits that shift off the end wrap around. 
Then AND with opa<39:8> (msb ex- 
tended with opa<39>). Then XOR 
with opa<63:40> (msb extended with 
opa<63>). 
5’d24 merge0 Twice9 only: Merge A and B, based alu<63> result <63:0> TWC9 
on bits from R'SDmaMergeOpHi|0] and 
R_SDmaMergeOpLo|0]. See 5.2.12.4 for 
details. 
: mergel Twice9 only: Merge A and B, based alu<63> result <63:0> TWC9 
on bits from R'SDmaMergeOpHi[1] and 
R_SDmaMergeOpLofA48 Rev 51328 


5’d26 merge2 Twice9 only: Merge A and B, based alu<63> result <<63:0> TWC9 
on bits from R'SDmaMergeOpHi[2] and 
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5.5.9 Memory Operation Field 


Enum 
DmaUelnstMemOp 
Attributes 


-allowle 


3°b000 Don’t start any memory operation 


3’b001 memReadO0 | Start block read from memory to the thread’s Mem- 
ory Read Buffer, doublewords 0-7. The memory address 
comes from the MemAddr register for the thread. 

3’b010 memRead1 | Start block read from memory to the thread’s Memory 
Read Buffer, doublewords 8-15. The memory address 
comes from the MemAddr register for the thread. 


3’b011 Start block read of currently selected packet buffer 


3’b100 sendIntr Send an interrupt to a processor. The instruction that 
sets sendIntr must produce an alu result in which re- 
sult<15:12> is the bus stop number of the interrupt tar- 
get and result<11:0> is the unique number that goes on 
CmdAddr. 

3’b101 memWrite0 | Start block write from the thread’s Memory Write Buffer, 
doublewords 0-7, to memory. The memory address comes 
from the MemAddr register for the thread. 

3’b110 memWritel | Start block write from the thread’s Memory Write Buffer, 
doublewords 8-15, to memory. The memory address 
comes from the MemAddr register for the thread. 


3’b111 Start block write from the currently selected packet buffer. 


5.5.10 Memory Transfer Length Selection 


Enum 
DmaUelInstMemLenSel 
Attributes 


-allowle 


3’b000 payloadLen | Transfer length comes from the payload length of the port 
associated with the current thread. Only threads 0-7 may 
use this encoding. 

3’b001 bytes8 Use transfer length of 8 bytes. The hardware will transfer 
a whole cache block, but the thread may be able to awaken 
sooner than if it asked for a 64 byte transfer. 


5.5.11 Sleep Mode Field 


Enum 
DmaUelInstSleep 
Attributes 


-allowle 
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2’b00 hwFlag After this instruction, sleep until a certain hardware flag is 
detected, for example the completion of a memory trans- 
fer. The condition is determined by the Sleep Index field. 


poli Reewed SCS” 


2’b10 takeMutex | After this instruction, sleep until this thread has exclu- 
sive ownership of the mutex identified in the Sleep Index 
field. The following instruction is allowed to read/write 
the shared resource. 

2’b11 dropMutex | After this instruction is completed, release the shared re- 
source identified in the Sleep Index field. The instruction 
which specifies DropMutex is allowed to read/write the 
resource, but the following instruction must not. 


5.5.12 Sleep Index Field, when Sleep=HwFlag 


If the Sleep field equals HwFlag, the Sleep Index field is encoded as follows: 
Enum 

DmaUelnstSleepFlag 

Attributes 


-allowle 


4’b0000 none Don’t sleep. For most instructions, you don’t want a sleep 
operation, so you should encode NONE. 


4’b0001 halt Halt is a sleep flag that is always false. If a process sleeps 
on this flag, it will never wake up. The only way a thread 
can awaken from halt is external software modification of 
the thread state. 


4’b0010 buffer After this instruction, sleep until there is a full (Rx) or 
empty (Tx) packet buffer from the selected port. The 
port that is monitored for packets is determined by the 
thread number. If SleepIndex=buffer is used in the same 
instruction as memLast=1, the stall field must contain at 
least 5 (?). 


4’b0011 mem After this instruction, sleep until all memory transfers 
started by previous microinstructions on this thread have 
completed. If SleepIndex=mem is used in the same in- 
struction as starting a memory operation, the Stall field 
must contain at least 4. 


ThETMF |__| Reserved SSCS 


5.5.13 Sleep Index Field, when Sleep=TakeMutex or DropMutex 


If the Sleep field is TakeMutex or DropMutex, the Sleep Index field tells which Mutex the instruction tries to 
acquire. The field is encoded as follows: 


Enum 
DmaUelInstSleepMutex 
Attributes 


-allowle 
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4’do ptr0 Use ptrO bit 8 concatenated with bits 3:0 to select the 
mutex. 
4dl 


ptrl Use ptrl bit 8 concatenated with bits 3:0 to select the 
mutex. 
4’d2 ptr2 Use ptr2 bit 8 concatenated with bits 3:0 to select the 
mutex. 


4’d3 ptr3 Use ptr3 bit 8 concatenated with bits 3:0 of (ptr3 xor 
processIndex<3:0>) to select the mutex. See Operand A 


addressing modes table for details. 


4’d4 ptr4 Use ptr4 bit 8 concatenated with bits 3:0 to select the 
mutex. 


*d5 userdef0 | First of 10 user defined mutexes, available for microcode 
to use however it wants. 

? 

? 

? 

? 

? 


i 


5.5.14 Internal Encoding of Sleep Conditions 


The sleep index field uses instruction bits plus parts of the thread state to select a particular hardware condition 
or mutex. Inside the DMA microengine, conditions and mutexes are treated almost the same. Conditions and 
mutexes resolve to a six-bit condition number that the thread selector can use to decide when to wake up a thread. 
The following table lists all the conditions that can cause a thread to sleep, and how they are encoded. The 
sleepCond register in the microengine (visible on RLSDmaSleepCondL and RLSDmaSleepCondH status registers) 
is a bit field whose bit numbers are defined by the Constant in this table. 


Enum 


DmaUeSleepCond 
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Const (Controlled by) 
6’hO0 MUTEXO First of 32 mutexes selected by ptr0-ptr4 value. A one uCode 
in this bit means that the mutex is available; zero means 


that the mutex is unavailable. 


6’h1 MUTEX31 Last of 32 mutexes selected by ptr0-ptr4 value. uCode 
67h20 MUTEX_USERDEFO | First of 10 mutexes selected by userdef0-userdef9 uCode 


6’7h29 MUTEX_USERDEF9 | Last of 10 mutexes selected by userdef0-userdef9 uCode 
6h2A MEMDONE_THRO All memory transfers started by thread 0 have completed. 
A zero in this bit means that thread 0 has started a trans- 
fer in the DMA cache interface which hasn’t completed. 
One means that all transfers started by this thread have 
finished. 

6h2B__|_ MEMDONECTHRI 
Gh2C_ | MEMDONE-THR2 
FHI 
Gh2H__ | MENDONE-THRA 
Gh2F [| MEMDONE-THRS 
5h30_ | _MEMDONE-THR6 
@hsi__|_MEMDONE-THRT 
@h32_| _MEMDONE-THRS 
@h33_| _ MEMDONE_THR9 
67h34 RXO_AVAIL A new packet is available in receive port 0. If this bit is HW 

one, a packet has arrived in the receive port and is ready 

to be processed. If zero, the microengine must wait for a 

packet to arrive. 


67h35 RX1_AVAIL A new packet is available in receive port 1 
6’h36 RX2_AVAIL A new packet is available in receive port 2 HW 


6’7h37 RX_COPY_AVAIL A new packet is available in the receive side of the copy 
port 


6’7h38 TXO0_AVAIL Empty packet buffer is available in transmit port 0. If this 
bit is one, the transmit port is ready for the microengine 
to send a packet; if zero, the microengine must wait before 
sending a transmit packet. 


6’7h39 TX1_AVAIL Empty packet buffer is available in transmit port 1 
Oh3A TX2_AVAIL Empty packet buffer is available in transmit port 2 


6’h3B TX_COPY_AVAIL Empty packet buffer is available in the transmit side of 
the copy port 


= 


sy 
= 


sy 
= 


aE 


6’7h3D IO_THREAD_AWAKE | Used to awaken the I/O processing thread during I/O 
O’h3E HALT HALT is a sleep condition that is always false. If a thread Constant 
see sleeps on this condition, it will never wake up. 
NONE 


6’h3F In an instruction, this value means “don’t sleep.” In the Constant 
sleepCond field of a thread’s state, it means “I’m not sleep- 
ing.” 

6’h00 FIRST_MUTEX the first entries are hardware flags. Which is the first 
entry that is a mutex? Update if encoding table changes. 


gha9_| _LAST-MUTEX 


5.5.15 Branch Field 


HW 
HW 
HW 
HW 


| 67h20__ | 
| 6h2B | 
| 620 | 
exe HED a 
| 6h2E | 
| hE | 
| 67h30 | 
| ohsl | 
| 6'h3s2 
| hss 
| hss | 
teal 
| 67h39 
csc 
| 6h3C | 


Enum 
DmaUelInstBranch 
Attributes 
-allowlc 
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Camas) 


3’b011 NZ NextAddr<0> = ALU Z 4 
3DI00 
Pspnosbr |__| Reserved ——SSSOSCSCSCSSCCCC[ SSCS 


5.5.16 Dedicated Microinstruction Addresses 


I/O space operations make address bits 19:16 available as the process index. (See ptr3 definition in Operand A 
addressing.) 


Address bits 6:3 are ANDed with a 4-bit kernel-programmable mask to produce microinstruction address bits 
3:0. (See PROG_IO register) 


Microinstruction address bits 7:4 are 0 for I/O writes, 1 for I/O reads, and 2 for SPCL writes. 
Defines 


DMA_UINST_ADDR 


10’h00 PROG_IO_WRITE For programmable I/O writes, execute microcode at this 
address plus the I/O write address bits 6:3 

10’h10 PROG_IO_READ For programmable I/O reads, execute microcode at this 
address plus the I/O write address bits 6:3 


10’h20 PROG_IO_SPCL For programmable SPCLs, execute microcode at this ad- 
dress plus the SPCL address bits 6:3 
10°30 DEFAULT_ENTRY_THRO | First instruction executed by thread 0 


10°h31 DEFAULT_ENTRY_THRI | First instruction executed by thread 1 
10’°h39 DEFAULT_ENTRY_THR9 | First instruction executed by thread 9 


5.5.17  Miscellaenous Constant Definitions 


Defines 
DMA 
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Const 
32’°d4 PBUF_N Number of packet buffers in a receive or transmit port. 
PBUF_N = 1 << PBUF_N_LOG_2 
32’d2 PBUF_N_LOG_ How many bits are required to encode the packet buffer 
number PBUF_N? 


32’d3 PBUF_BUF_MASK Bitmask used for selecting the buffer number in the low 
bits of the PBUF address. 
PBUF_MASK = PBUF_N - 1 


32’d64 PBUF_WORDS Number of words in a receive port packet buffer 
327d72 PBUF_BITS Number of bits in packet buffer 
32°d72 OPRF_BITS Number of bits in receive port operand regfile. 


32’d44 OPRF_WORDS Number of words in a receive port operand regfile. 32 
words of fabric switch control/staus registers + 3 regs * 4 
packet buffers = 44. The three regs are pktHead, pktCtl, 
and pktTrail. 


327d5 N_OPERAND_PTRS Number of pointers in operand A, B, and destination 
32’d72 DMEM_BITS Number of bits in microengine data memory 


327d1024 DMEM_WORDS Number of words in microengine data memory. It’s split 
into two halves, each DMEM_WORDS/2. 
32’°d4 DMEM_INTERLEAVE_BIT Which bits of DMEM address determines interleaving 
of data between the four banks halves. The banks 
are interleaved on bits DMEM_INTERLEAVE_BIT and 
DMEM_INTERLEAVE_BIT+1. 
DMEM_PROCESS_INCR Add this to the DMEM address to find the next pro- 
cess. The address for process descriptor P would be 
DMEM_PROCESSO + P * DMEM_PROCESS_INCR. 


32’d64 UIM_BITS Number of bits in microinstruction memory 
327d1024 UIM_WORDS Number of words in microinstruction memory 


32’d10 UIM_ADDR_BITS Number of bits needed to specify an address in the 
microinstruction memory. 1<<UIM_ADDR_BITS = 
UIM_WORDS. 


32’d10 N_THREADS Number of threads in microengine 
32’d2 AX_TASKS_PER_THREAD Number of cache interface operations per thread 


32’d4 OUTSTANDING_READS ICE9 only: Maximum number of outstanding reads from 
bua tet Memory Bat | 
32’°d4 OUTSTANDING_WRITES ICE9 only: Maximum number of outstanding writes from 
32’°d7 OUTSTANDING_READS_TWC_ | TWC9 only: Maximum number of outstanding reads from 
ee | baa 2 Memory Bat | 
32’°d7 OUTSTANDING_WRITES_TWC | TWC9 only: Maximum number of outstanding writes 


32’°d4 NUM_MEMOUT_SEQ Number of MemOut address sequencers. There are four 
sequencers, one for each of: rxp0, rxpl, rxp2, and copy 
ports. 
32’°d4 NUM_MEMIN_SEQ Number of MemOut address sequencers. There are four 
sequencers, one for each of: txp0, txpl, txp2, and copy 
ports. 


8’hAO RMB_IO_COREO address in memory read buffer (RMB) where I/O write 
fo ee | data from core 0 is stored 

SHAS 

87h08 RMB_IO_ADDR_INCR distance between I/O data addresses. Use 
PEERED [Rvis'10. CORED +N" RAIB-IO_ADDICINGR 

8’hAO WMB_IO_COREO address in memory write buffer (WMB) where I/O read 
re | data for core 0 is stored 


8’hA8 WMB_IO_CORE1 addr in WMB where data for core 1 is stored 


8’h08 WMB_IO_ADDR_INCR distance between I/O data addresses. Use 
14, 2014 WrB_IO_COREO + N * WMB_IO_ADDR_INCRRev 518 
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5.5.18 DMA Thread Numbers 


This table shows what tasks are assigned to the DMA microengine threads. 
Defines 
DMA_THR 


32’d0 Thread that services receive port 0 
32’dl1 Thread that services receive port 1 


32’°d2 Thread that services receive port 2 
32’d3 COPY_RX Thread that services the receive side of the copy port 
32°d4 Thread that services transmit port 0 


5.5.19 DMA Port numbers 


Enum 
DmaPort 


Receive port 2 control registers (read only) 
Copy port memories, transmit side 


5.5.20 DMA Queue numbers 


These constants are chosen to match the order in the Common Control/Status (Kernel R/W) table. If that 
table is converted to a form that vspecs can read, then DmaQueue is redundant and should be removed. 

Enum 

DmaQueue 


Receive port 0 queue 


| 4d6 | 
Copy port memories, transmit side 


5.5.21 DMA Internal Memory Addresses 


Class 
DmalnternalAddr 
Attributes 
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w0[5:0] mem DmalnternalMem The mem field tells which of the DMA’s memories is se- 
w0[15:6] index The index field tells the address in the selected memory. 


w0[15:0] allBits for reading the whole structure as a single bit vector. 
Overlaps allowed. 


5.5.22 DMA Internal Memory Addresses (Mem Field) 


This table creates an encoding for every memory in the DMA engine. The encodings are used for several 
different purposes, including operand and destination selection and memory<=>cache transfers. These values are 
useful to circuit implementors but not to programmers. 

It is important that the memories addressed by the cache interface are grouped together so that some number 
of low bits of the constant can distinguish them. 

Enum 

DmalInternalMem 


Dafnition 
Pon39 | 


6’h30 RXO0_ PBUF RX port 0 packet buffers. 
NOTE: all packet buffers are 0x30 to 0x8F. Circuits that 
only refer to packet buffers don’t need to store all 6 bits. 
They can just use values like RX2_PBUF - RXO_PBUF 
and store only 4 bits. Also, it’s important that the first 
8 packet buffers starting with RXO_PBUF are in thread 
order. 


Pons 


6’7h38 RMBO Read memory buffer 0, in copy port. In fact RMBO and 
RMB1 are two adjacent regions in the same memory. 


6’7h39 RMB1 read memory buffer 1, in copy port 
eh3A_ | WMBO 
6’h3B WMBI1 write memory buffer 1, in copy port 


5.5.23 Receive Port Buffer State Machine 


Enum 
DmaRxpState 
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Constant 


2’b00 ST_SWRX | transfer from switch pending 
2’b01 ST_WAITUE | tx from switch done. wait to enter ST_UE. 


2’b11 ST_UE selected for microengine operations 
2’b10 ST_CA cache operation pending 


5.5.24 Receive Port CMUX Select Values 


Enum 
DmaRxpCmuxSel 


4’b0000 NONE select nothing. cmux will output all zeroes. 
4’b0100 RXPO select data from receive port 0 


4’b0101 RXP1 select data from receive port 1. 
4’b0110 RXP2 select data from receive port 2. 
TOL COPY 


4’b0111 UNIT_SEL_MASK bits 2,1,0 indicate which unit is selected. 


4’b1000 | ENABLE_ODD_WORD | bit 3=1 enables the odd word. bit 3=0 clears the odd word. 


5.5.25 Transmit Port Buffer State Machine 


Enum 
DmaTxpState 


Constant | Mnemonic 
2’b00 IDLE tx from switch done. wait to enter ST_UE. 


2’b01 selected for microengine operations 
PII 
2’b10 SWTX transfer from switch pending 


5.5.26 ‘Transmit Port: Packet Builder State Machine 


Enum 
DmaTxpBldPktState 


SEND_TRAILER 


5.5.27 Copy Port Buffer State Machine 


Enum 
DmaCopyState 


Definition 


waiting to enter UERX 
UERX selected for microengine operations in Copy RX thread 
WRMEM | write operations pending in DmaCif 


5.5.28 Copy Port: Read/Write Memory Buffer Address 
Class 
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DmaCopyMbAddr 
Attributes 


w0[7:4| thread Thread number, 0-9. Also, I/O reads and writes use these 
buffers to store data being read or written, by setting 
thread to ores LE 


eee 0] eo in low 3 bits select which doubleword within a cache 
block 
w0[7:0] allBits for reading the whole field as one bit vector. Overlaps 
allowed. 


5.5.29 Dma Cache Interface Task 


used in ReadWriteQ, ReadWriteExtQ, Outstanding WriteTable 
Class 

DmacCifTask 

Attributes 


P| ___] ieroengine thread wamber J 

w0[7:4] localTarget which internal DMA unit will be accessed. Encoding is 
pt mee |e 
-x0(a8) | dwordset [| mumber of doublewords remaining to be transfered (0-10) | 
Fw0pe013[ | — Tocalddr [| [address ir Tocal DMA memory, 8 bits 


w0[53:21] memAddr address in main memory. There are 33 bits for Ad- 
Lod dress<35:3>. 
eee | [56:54] type what command to send to CSW? In block read and write 
queues, only BRD or BWT will appear. 


inom firstBlock32Byte 1=This is the first cache block transfer in a transfer that 
starts on a half-cache-block boundary. O=any subsequent 
blocks 
swapEvenOdd is the memory address aligned to a doubleword or not? if 
(craic (eee cress (A (| 0, it is aligned. if 1, enable 32-bit swap. 
[59] valid is this a valid task or just a No-op? 1=valid task. O=no 
Sr} —s—t—}-— operation. ignore all the other bits in this task. 
w0[63:0] allBits for reading the whole field as one bit vector. Overlaps 


5.5.30 Dma Cache Interface: Memory Operation Type 


These encodings are used on the ue_cif_TaskType_c5a bus. 
Enum 


DmaUeMemOpType 
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4’d0 Start write operation from an internal DMA memory to 
the L2. The length of transfer comes from the payload 
length register in the port associated with this thread. 
4’d1 


(only threads 0-7) 

Start read operation from the L2 to an internal DMA 
memory. The length of transfer comes from the payload 
length register in the port associated with this thread. 
(only threads 0-7) 

Same as BWT except the length is 8 bytes. The hardware 
will transfer a whole cache block, but the thread may 
be able to awaken sooner than if it asked for a 64 byte 
transfer. 

Same as BRD except the length is 8 bytes. The hardware 
will transfer a whole cache block, but the thread may 
be able to awaken sooner than if it asked for a 64 byte 
transfer. 


ra [ id Reed SOS—“—*~—~—“—*—s—s—‘“‘—‘—s~*—“~*~“—~*~*~*S 
4d13 IORD Response to I/O read from a core. Drive Data only. This 
memory op does not increment the thread counter. 


4’d14 SPCL Response to a SPCL from a core. Drive DONE command 
onto CmdAddr bus. This memory op does not increment 
the thread counter. 

4’d15 INTR Send an interrupt. The bus stop number will be on 
alu_cif_MemAddr<15:12> and the unique id will be on 
alu_cif_MemAddr<11:0>. This memory op does not in- 
crement the thread counter. 


5.5.31 Dma Cache Interface: Type of Task 


These encodings are used for the type field in the Task data structures. 


Enum 


DmaCifTaskType 
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3’b000 BWT Start block write to coherence controller. Drive BWT on 
CmdAddr. 


3’ ees Start block read. Drive RDS (????) command on Cm- 
dAddr, then wait for Data to arrive. 


3’ aeet PRBDONE | End of block read protocol. After data arrives, only if 
??2?, send a PRBDONE on CmdAdadr to notify COH that 
read is complete. 


3’ ieee [eae When WRIO arrives from a core, send RDIO on Cm- 
[tema Wee as a response. 


3’ bad IORD Response to I/O read from a core. Drive Data only. 


3’ eee [eee Send an interrupt to a processor. See “sendIntr” in the 
Raa field for more details. 


3’ iced BRDR Block read retry 


3’ Te SPCL SPCL command, a CmdAddr-only command that trig- 
gers a programmable I/O operation. When SPCL is in 
the StartIoQ it causes the microengine to execute an I/O 
operation. When SPCL is in the WriteQ it causes the 
DmaCif to send a DONE command back to the proces- 
sor. 


5.5.32 Dma Cache Interface: Numbering of Queues 


Enum 

DmaCifQueueNum 

data write queue 

1-deep skid buffer, holds a dequeued data task during stall cycles 


5.5.33 Dma Cache Interface: Depth of Queues for ICE9 


Defines 
DMA_QUEUE_SIZE 
data response queue 


| dataresponse queue 
1-deep skid buffer, holds a dequeued data task during stall cycles 
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5.5.34 Dma Cache Interface: Depth of Queues for TWC9 


Defines 
DMA_QUEUE_SIZE_TWC 


data write queue 
1-deep skid buffer, holds a dequeued data task during stall cycles 
Startlo queue for I/O reads 


5.5.35 Dma Cache Interface: Outstanding Read Table entry 


Class 
DmaCifOrtEntry 
Attributes 


| = w0/[0} | 7 | valid | | this table entry is valid 
swapEvenOdd <ateart}-————— 1l=use 32-bit alignment 


a align Alignment information for transmit buffer. 
MemAddr<5:3> is stored here so that when we 
send the packet to the FSW, the TX port knows how to 
align the data. 


w0[12:5] localAddr | {| ~————_[ address in local DMA memory 


w0 *. memAddr so that we know the address for BRDR and PRB- 
DONE. We know there is duplication between align 
and memAddr, but we’re leaving it because we thing 
memAddr can be eliminated. 


biseseeal 46] eee | ae internal DMA unit will be accessed? The encoding 
is the low 4 bits of DmalnternalMem. 


hia [53:50] ac DA thread number, needed for thread accounting 


w0[63:0] allBits for reading the whole field as one bit vector. Overlaps 
allowed. 


5.5.36 Dma Cache Interface: Outstanding Write Table entry 


The OWT data is encoded using the DmaCifTask data structure. 


5.5.37 Dma Cache Interface: Block Read Retry Queue (BrdrQ) for ICE9 


Class 

DmaCifProtocolEntry 

Pw] | td [| —'(| Tension bits —SSSOSCSCSCS~S~—~S~SsSCS 
-w0(S5] [dest |_| [12 bss stop number of the block that we will wiite to | 
WO] [vad [iid ais table entyisvaid ——SCSCSC—C~—“C*~*d 


w0[63:0] allBits for reading the whole field as one bit vector. Overlaps 
allowed. 
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5.5.38 Dma Cache Interface: Block Read Retry Queue (BrdrQ) for TWC9 


Class 
DmaTwcCifProtocolEntry 


PEO [aa TWO [Sid renmaction dbs SOSCSC~S 
w0l9:6] [dest TWO9A |__| —__| 12 bus stop number of the block that we will write to | 
_wotto] [vad TWOA | [dis table enya SCC~*d 


w0(63:0] TWCO9A ow 4 for reading the whole field as one bit vector. Overlaps 
allowed. 


5.5.39 Dma Cache Interface: Command RDIO Queue (CrdioQ) 


This queue is encoded with DmaCifProtocolEntry. 


5.5.40 Dma Cache Interface: SPCL/INT Queue (CSpcliIntQ) for ICE9 


Class 

DmaCifSpclIntEntry 

Datiition 

| w0[4:0) | tid [| =| ~~ | Transaction id bits (for SPCL only) 

| w0[11:0] | intReason | = | ~~‘ _ Interrupt reason (for INT only). Overlaps tid. 

PwOlis2] [dest [| __ 12 bus stop number of the block that we will wiite t —| 
FP w0lTd) [Spel [ich type of command is this? T=SPCL, O=INT 
PwOpOa7] [thread | [| Thread number Gor INT only) ——___] 
wot) [valid | [____[ this table entry is valid] 


w0[63:0] allBits for reading the whole field as one bit vector. Overlaps 
allowed. 


5.5.41 Dma Cache Interface: SPCL/INT Queue (CSpclIntQ) for TWC9 


Class 
DmaTwcCifSpclIntEntry 


w0[5:0] TWOC9A|{[ = | ~——__[ Transaction id bits (for SPCL only) 
w0/[11:0] TWCOA | | ~————*X_ Interrupt reason (for INT only). Overlaps tid. 
w0[15:12] TWC9A | | ~——__—«X|_L:2 bus stop number of the block that we will write to 


w0lts] _[~isSpel_[ TWOIA | —_ |__| which type of command is this? I=SPCL, 0=INT 
WORT] [thread [| TWO9A |_| | Thread number (for INT only) 
> wo] [valid [TWO9A |_| ___[ this table oniryis valid SOS 


5.5.42 Dma Cache Interface: Data Response Queue (DataRspQ) 


This queue is encoded with DmaCifProtocolEntry. 


5.5.43 Dma Cache Interface: Data Write Queue (DWQ) 


This queue is encoded with DmaCifProtocolEntry. 


5.5.44 Dma Cache Interface: I/O Read Queue (DRDIOQ) 


This queue is encoded with DmaCifProtocolEntry. 
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5.5.45 Dma Cache Interface: StartIoQ for ICE9 


Class 


DmaCifStartloEntry 

Attributes 

ao) [| wpe | DmaGiStartiotype | ‘(| RDIO@r WHO SPCLSSSCSCS~S~S~S 
Papa] | woAd | SSS*CSSCSCS*diCSB ts comnesponding to wT CS 


do[39:35] | tid of ts—<—tsSYSCsCsSs«L2 transaction id for this I/O operation 
do[43:40] | origin [| t—t—<—tssSsSYS sibs stop number of originator 
do[63:0} | allBits [| ~~ | ~~‘ for reading all bits at once. Overlaps allowed. 


5.5.46 Dma Cache Interface: StartIloQ for TWC9 


Class 

DmaTwcCifStartloEntry 

Attributes 

Dafnition 

avo. [type [ TWOOA | DimaCiStartlolype |__| RDIO or WTO o SPOL 

-a0js42] | toAddr | TWOOA|~~~«YY | 83 bits corresponding to csw_dma_Addr<ab3> 
| do[40:35) | tid | TWC9A]T |] L2 transaction id for this I/O operation 
Pd0faeat] [origin [TWOOA [|__| bus stop number of originator 


d0[63:0] TWO9A[  — ——S~SYXT SST _ for reading all bits at once. Overlaps allowed. 


5.5.47 Dma Cache Interface: StartloType 


These encodings are used for the type field in the StartIo data structure. 


Enum 


DmaCifStartloType 
1/O operation is a write 


2’b10 SPCL I/O operation is a special (one way message from core to 
DMA) 


2’b11 | ‘| _ reserved 


5.5.48 Dma Cache Interface: Address memory entry 


Class 

DmaCifAdmEntry 

Attributes 

A050] | monAddr | |__| address mma memory 
Fado[sos6[ [Ten | |__| number of doublewords to transfer 


d0{63:0] allBits for reading the whole field as one bit vector. Overlaps 
allowed. 


5.5.49 Dma Cache Interface: MemOut Address Sequencer States 


Enum 
DmaCifMoaState 
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sp000 [IDLE 


5.5.50 Dma Cache Interface: MemIn Address Sequencer States 


Enum 
DmaCifMiaState 


5.5.51 Internal Encodings for Microengine Operands 


These values are used within the microengine on signals ue_xxx_OpaAddr_c3a, ue_xxx_OpbAddr_c3a, and 
ue_xxx_ResultAddr_c5a. Because many of the things the microengine can address are accessible from I/O, we’re 
using I/O addresses even for some of the things that are internal. 

Defines 

DMA_OP_ENC 


24°h321312 | SPCL_DATA | spclData register 


5.5.52 I/O Region Type (DmaloRegionType) 


This data type describes regions of I/O addresses in the table above. 
Enum 
DmaloRegionType 


3’b101 FIXED_RW_OPA | Region is readable and writable by fixed I/O. Reads use 
operand A. 
; PB i 


3’b011 FIXED_RW_O Region is readable and writable by fixed I/O. Reads use 


operand B. 


3’b111 NONE not a valid region type 


5.5.53 External I/O Addresses 


Assume the DMA engine I/O space starts at DMA_IO_BASE. Everything else is specified as an offset relative 
to DMA_IO_BASE. 

Defines 

DMA_IO 
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36’hE_8100_0000 BASE Start of DMA engine’s I/O space 
36° hE_843F_FFFF End of DMA engine’s I/O space 


24’h010000 PAGE_SIZE These addresses are calculated based on a page size of 
64kb = 0x10000 bytes. As of 5/16/2005 that was our 
best guess. 


pense fi reserved SOSSOSOSOSCSOSCSCSCSCSCS 


24°h321000 FIRST_UE_REG | Address of first DMA register whose value lives in the 
microengine module DmaUe 


24°h321300 Reserved for internal encodings. See the table 
DMA_OP_ENC for details. 


5.6 Registers Accessible by RDIO/WTIO from Processors 


5.6.1 DMA Instruction Memory (IMEM) 


Every location in the DMA instruction memory is I/O accessible. At node initialization time, every location 


must be initialized to a known value, to ensure repeatable results and to avoid false detection of ECC errors. The 
IMEM may only be accessed when every DMA thread is disabled (see R-DmaThreadSel). 


Register 

R_Dmalmem|[1023:0] 

Address 

0xE_8131_0000-0xE_8131_1FFF (Add 0x8 per entry) 
Attributes 

-kernel 


63:0 Instr RW Xx Allows read/write access to one word of 
IMEM. 


5.6.2 DMA Data Memory (DMEM) 


Every location in the DMA data memory is I/O accessible. At node initialization time, every location must be 


initialized to a known value, to ensure repeatable results and to avoid false detection of ECC errors. Usually the 
processors will not access Dmem while the microengine is running, but it is perfectly legal to do so. 


Register 

R_DmaDmem|{1023:0] 

Address 

0xE_8130_0000-0xE_8130_1FFF (Add 0x8 per entry) 
Attributes 

-kernel 


63:0 Data RW Xx Allows read/write access to one word of 
DMEM. 


5.6.3 DMA Thread Select Register 


Register 
R_DmaThreadSel 
Attributes 
-kernel 

Address 
0xE_8132_1100 
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threadEnable RWS The thread enable bits allow external software 
to control which threads execute and which do 
not. When the bit corresponding to a thread 
is 1, the thread is allowed to issue instructions, 
subject to the countdown behavior. When 0, 
the thread may not execute any instructions. 
The threadEnable bit corresponding to the 
I/O thread is ignored because the I/O thread 
cannot be disabled. 


31:16 countdown This 16-bit counter allows software to ask 
the DMA engine to execute N instructions 
and then halt. When countdownHalt=1, the 
counter decrements as each microinstruction 
is issued, but when it reaches zero, all threads 
(except for the I/O thread) stop issuing in- 
structions until software intervenes. 

countdownHalt RW When 1, enable countdown-and-halt behavior 
described above. When 0, disable countdown- 
and-halt behavior. 


Cautionary Note: ThreadEnable bits must be used with caution: any thread can take a mutex flag which may 
be needed by the I/O thread in order to service a read, write, or spcl request (that is, requests to DmaApplface0 or 
DmaApplface1). If a processor issues such a request while a stopped thread is holding such a mutex, the processor 
will be hung and must be reset to recover. 

In current microcode (as of March 2006), only writes of eventQRdSize depend on a mutex. 


5.6.4 DMA Thread Pointer Registers 


This table describes the thread pointer registers. There are 10 in all, one for each DMA microengine thread. 
Register 

R_DmaThreadPtr{9:0] 

Address 

0xE_8132_1000-0xE_8132_104F (Add 0x8 per entry) 

Attributes 

-kernel 


Ni 
aofi910) [pint | _RW_ [0 | Pointer mto dmem—S—S—S—S—SCSCSCC~*' 


Pig a | pa [frre 
Fd0j39:30) [pts [| _RW_| 0 | Pointer mtodmem —SS—S—S—SCSCS—CC‘*' 
Fas. —piet [RW 0 Pata nto den 


5.6.5 DMA Thread Program Counter Registers 


This table describes the thread PC registers. There are 9 in all, one for each DMA microengine thread except 
for the I/O thread #9. The I/O thread has internal registers for pe and sleepCond, but they are not visible to 
software because the act of reading or writing an I/O register affects the I/O thread’s values. 

Register 

R_DmaThreadPc[8:0] 

Address 

0xE_8132_1080-0xE_8132_10C7 (Add 0x8 per entry) 

Attributes 

-kernel 
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[Definition —SSCSCSC~—S 


eal Program counter ———— the thread. The pc tells 
what address in instruction memory to read. 


ot 10 eat alles DmaUeSleepCond | This field indicates whether the thread is wait- 
ing for a condition to become true. If sleep- 
Cond is set to DmaUeSleepCond_NONE, the 
thread is NOT waiting for any condition; oth- 
erwise the field encodes which condition it is 
waiting for. 


5.6.6 DMA Programmable I/O Control Register 


Register 
R_DmaProglo 
Attributes 
-kernel 
Address 
OxE_8132_1108 


3:0 | ioAddrMask RW Oxf | For programmable I/O operations, the ioAddrMask bits 
are ANDed with the I/O address bits when generating the 
microinstruction address to execute. 


5.6.7 DMA Application Interface Region 0 


This is an address range in which loads and stores causes the DMA to execute microcode. 
Register 

R_DmaApplface0[0x1FFFF:0| 

Address 

0xE_8110_0000-0xE_811F_FFFS8 (Add 0x8 per entry) 

Attributes 

-noregtest -kernel 


Definition 


63:0 Data RW Xx Programmable I/O region 0. A load or store to this ad- 
dress range in a processor causes a RDIO and WTIO com- 
mand on the CSW, which triggers a sequences of microcode 
in the DMA engine. WTIO to address X causes the mi- 
croengine to execute instructions starting at IMEM address 
DMA_UINST_ADDR_PROG_IO_WRITE + (X[6:3] & ioAddr- 
Mask). RDIO from address X causes the I/O thread in the 
microengine to execute instructions starting at IMEM address 
DMA_UINST_ADDR_PROG_IO_READ + (X[6:3] & ioAddr- 
Mask). 


5.6.8 DMA Application Interface Region 1 


This is an address range in which stores cause the DMA to execute microcode. 
Register 

R_DmaApplfacel[0x1FFFF:0| 

Address 

0xE_BE20_0000-0xE_BE2F_FFF8 (Add 0x8 per entry) 

Attributes 

-noregtest -kernel 
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63:0 Data W Xx Programmable I/O region 1. A store to this address 
range in a processor causes a SPCL commands on the 
CSW, which triggers a sequences of microcode in the 
DMA engine. SPCL to address X causes the micro- 
engine to execute instructions starting at IMEM ad- 
dress DMA_UINST_ADDR_PROG_IO_SPCL + (X[6:3] & 
ioAddrMask). 


5.7 Registers Accessible by Serial Configuration Bus 


The DMA block has registers accessible by RDIO/WTIO and others accessible by the SCB. All SCB registers 
have the prefix “R_SDma’” to indicate that they are on the SCB. 


5.7.0.1 Block Reset Register 


This register allows the RX/TX ports of the DMA to be reset individually. Each port has an active-high signal 
which forces everything back to its reset state. After the DMA block is reset, the ports remain in reset until software 
initializes the DMA and decides to allow packets to flow. This ensures that an unconfigured DMA cannot cause 
the fabric to back up. 


Register 

R_SDmaBlockReset 

Attributes 

-kernel 

Address 

——eeoe 

Se SS a eee 


5D: 3 TxReset One bit per transmit port. Bit 3+N affects TX port N. 
When reset is high, all state in the transmit port is cleared. 
The SoP, EoP, and Dat Val signals to the fabric switch are 
held low. TxpN_ue_BufAvail_cla is deasserted so that the 
microengine believes that all packet buffers are full. 


RxReset One bit per receive port. Bits 2:0 affect RX2,1,0. 
When reset is high, all state in receive port is 
cleared. Dma_fsw_RdyN_sla is asserted so that any 
incoming fabric packets are accepted and dropped. 
RxpN_ue_BufAvail_cla is deasserted so that the micro- 
engine believes that no packets have arrived. 


5.7.0.2 ECC Mode Register 


Register 
R_SDmaEccMode 
Attributes 
-kernel 

Address 
0xE_0100_0004 
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| 31:7 | elie. s< il i | —— 4 Reserved 


| Resetved | 
eat Enable ECC correction in CIF. This logic is only needed 
when the microengine does a BRD from a memory address 
with bit 2 set (32-bit realignment). 
Bug2396: When CifCorrEna is off and the microengine 
does a BRD from a memory address with bit 2 set, the 
ECC written into the DMA’s internal memory (TX or 
COPY port packet buffer) is incorrectly forced to zero. 
Data with corrupted ECC may reach the FSW or main 
memory when the packet is sent. The safest workaround 
is to always leave CifCorrEna on. 


15 | ImemCorrEna_| | RW | ‘| Enable ECC correction during Imem reads 
= ee} Enable ECC correction during Dmem reads 


CopyCorrEna RW ar Enable ECC correction when the Copy port reads a mem- 
ory and places data onto the Operand B bus 


2:0 RxpCorrEna RW 7 Enable ECC correction when the RX port reads memory 
and places data onto the Operand B bus 


5.7.0.3 ALU Merge Operation Control Registers (added in Twice9) 


Register 
R_SDmaMergeOpHi[3:0] 
Attributes 

-kernel -noregtest 

Address 
0xE_0100_0020-0xE_0100_002C 


31:0 | Hi TWCO9A | These four registers control the operation of the DMA 
ALU operation Merge0, Mergel, Merge2, and Merge3. 
R_SDmaMergeOpHi|N] controls bits 63:32 of the MergeN 
result, while RSSDmaMergeOpLo[N] controls bits 31:0 of 
the MergeN result. See 5.2.12.4 for details. 


Register 
R_SDmaMergeOpLo/3:0] 
Attributes 

-kernel -noregtest 

Address 
0xE_0100_0030-0xE_0100_003C 


31:0 | Lo RW TWCO9A | These four registers control the operation of the DMA 
ALU operation Merge0, Mergel, Merge2, and Merge3. 
R_SDmaMergeOpHi|N] controls bits 63:32 of the MergeN 
result, while RSSDmaMergeOpLo[N] controls bits 31:0 of 
the MergeN result. See 5.2.12.4 for details. 


5.7.0.4 Force Error Register 


This register causes the circuit to intentionally produce specific errors. This will help us to test error detection 
logic and error handling software. 

Register 

R_SDmaForceErr 

Attributes 

-kernel 

Address 

0xE_0100_0008 
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: r | |_| Reserved 

These bits are XORed with bits 1 and 0 of every word of 
data being written to the data memory. If a corrupted 
data is read from Dmem, ECC correction logic in the 
Dmem (if enabled) will detect the error and set a bit in 
R_SDmalIntCause. 

These bits are XORed with bits 1 and 0 of every word 
of data being written to the instruction memory. If a 
corrupted data is read from Imem, ECC correction logic 
in the Imem (if enabled) will detect the error and set a 
bit in R-SDmalIntCause. 


These bits are XORed with bits 1 and 0 of every word 


of data being written to the copy port packet buffer and 
read/write memory buffer. Corrupted data in the packet 
buffer will be sent out the CSW to another block. Cor- 
rupted data in the read/write memory buffer will be cor- 
rected if the microengine reads it, but if it written back 
to CSW it will not be corrected by DMA at all. 

These bits are XORed with bits 1 and 0 of every word of 
data being written to the packet buffer of every transmit 
port. This field allows software to intentionally corrupt 
the data that is sent out the TX port to the fabric switch, 
to test the ECC correction logic in the fabric switch. 


5.7.0.5 Microengine Status Registers 


Register 

R_SDmaUeStatus1 

Address 

0xE_0100_0108 

Definition 

Es 

[300 | PrevThread [R[ 0 |__| Which thread ran Tast (0-9) 


Register 
R_SDmaUeSleepCondsL 
Address 
0xE_0100_0100 


31:0 | SleepCondsL | R xX Lower 32 bits of the SleepCond vector in the microengine. 

For each bit, 1 means that the condition is “available” or 
“ready”. O means that any thread waiting for that con- 
dition would continue to wait. The bit numbers of the 
SleepCond vector are defined by the enum DmaUeSleep- 
Cond. 
Example: Does thread 3 have a memory operation out- 
standing in the DMA cache interface? The DmaUeSleep- 
Cond table has a row called MEMDONE_THR3 whose 
value is 0x2D. So you’d read SleepCondsH and Sleep- 
CondL, concatenate them into a 64-bit vector, and look 
at bit number O0x2D. If that bit is zero, thread 3 has a 
memory operation outstanding. 


Register 
R_SDmaUeSleepCondsH 
Address 
0xE_0100_0104 
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LCR 


31:0 | SleepCondsH | R Upper 32 —————— of the SleepCond vector in the microengine. 
See SleepCondL for details. 


5.7.0.6 Cache Interface Status Registers 


Register 
R_SDmaCifStatus1 
Address 
0xE_0100_0110 
3124. | = | a Reserved 


|p Resetved | 
eae 16 Sen oat Reads the 8 RefCntZero signals that go from the cache 
interface to the various ports. Use the DmaPort enum to 
decide which bit represents which bit, e.g. bit 8-+-Dma- 
Port::TX0 represents cif_txp0_RefCntZero_cda. 


15:12 | WriteTidBusy |R == [0 [|__| A copy of the TidBusy wires for the 4 DMA write TIDs 
ReadTidBusy |R === | 0 =| ~_‘[ A copy of the TidBusy wires for the 4 DMA read TIDs 


7:4 Owt Valid R Valid bits of the outstanding write table. If bit 4+X is 
set, the DMA has an outstanding write on DMA write tid 
X. 
3:0 Ort Valid R Valid bits of the outstanding read table. If bit X is set, 
the DMA has an outstanding read on DMA read tid X. 
Register 


R_SDmaCifStatus2 
Address 


OxE_0100_0114 


Definition 


31:16 | OwtThread | R X Four fields of four bits each. Bits (19+4*X to 16+4*X) 
are the thread number of the Outstanding Write Table 
entry X. 


15:0 | OrtThread | R Xx Four fields of four bits each. Bits (3+-4*X to 4*X) are the 
thread number of the Outstanding Read Table entry X. 


5.7.0.7 Rx/Tx Port Status Registers 


There are three port status registers, on for each RX and TX port. R-SDmaPortStatus|N] gives the status of 
RX port N and TX port N. 


Register 
R_SDmaPortStatus[2:0] 
Address 
0xE_0100_0120-0xE_0100_0128 
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| 31:26 | 26/ | | = ol Resetyed” — ee | 


par 24 Par — 4 the transmit port, which ST TEE EEE EERETa Hea EAT buffer is the micro- 
engine working on? 


ia 16 | TxBufState Read the packet buffer state. This field contains four bit 
fields of 2 bits each. Bits (17+2*M to 16+2*M) gives 
the state of packet buffer M. The 2-bit fields are of type 
DmaTxpState. 


1 | 15:10 | 1 Cali . 1441. 2] | | Reserved eee 


poe RTT + ITI receive port, which packet buffer is the microengine 
working on? 


lat nae Read the packet buffer state. This field contains four bit 
fields of 2 bits each. Bits (1+2*M to 2*M) gives the state 
of packet buffer M. The 2-bit fields are of type DmaRxp- 
State. 


5.7.0.8 Copy Port Status Register 


Register 

R_SDmaCopyPortStatus 

Address 

O0xE_0100_0130 

pe, SC~dSC“‘CSYO#*iSCOWC*diReMed——SC~=—“~S*S*~“—s*S*SC“CS~“S~s~S 


15:14 | CopyTxWhichBuf | R In the copy port, which packet buffer is the 
DMA_THR_COPY_TX thread of the microengine work- 
ing on? 


CopyRxWhichBuf | R In the copy port, which packet buffer is the 
DMA_THR_COPY_RX thread of the microengine work- 
ing on? 

11:0 CopyBufState R xX Read the packet buffer state. This field contains four bit 
fields of 2 bits each. Bits 3*M gives the state of packet 
buffer M. The 2-bit fields are of type DmaRxpState. 


5.7.0.9 Interrupt Cause Register 


The interrupt cause register contains flags which are set when an event occurs, and cleared by software by 
writing a 1 to that bit. 


Note on ECC correction and interrupt bits: Assuming correction is enabled, if a single bit error is detected, the 
data is corrected and the Sbe interrupt cause bit is set. If a double bit error is detected, both the Dbe interrupt 
cause bit and the Sbe interrupt cause bit are set, and the bad data will not be modified. 


Register 
R_SDmalIntCause 
Address 
0xE_0100_0200 
Attributes 


-kernel 
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3l Intr R This bit is 1 when any bit in the 
expression R_SDmaIntCause[30:0] & 
R_SDmalIntMask[30:0] is set. It becomes 
the primary output dma_xxx_Int_ca. 
Sere ee ee 


30:15] Reserved. 
14 


CifDbe RW1C Cache Interface Double Bit Error. A double 
bit error has been detected in data read from 
ECC correction/detection only occurs in the 
CIF if a block read is performed with address 
bit 2 equal to 1. If address bit 2 is 0, the CIF 
does not check ECC at all, and the data goes 
straight to the TX or copy port. 


ImemDbe Imem Double Bit Error. A double bit error 
ne has been detected in data read from the In- 
DmemDbe 
has been detected in data read from the Data 
Memory. 
Copy Port Double Bit Error. A double bit 
error has been detected in data read from the 
packet buffer or read/write memory buffer in 
the copy port. 
ECC correction/detection occurs if the mi- 
croengine reads a corrupted data ford in the 
packet buffer or read/write memory buffer. 
But if the corrupted ford is written straight 
back to the CSW, no correction/detection oc- 
scribes errors from RX port N. A double bit 
error has been detected in data read from the 
receive port packet buffer or the receive port 
operand memory. 
ECC correction/detection occurs if the micro- 
engine reads a corrupted data ford that came 
from the fabric switch. But if the packet is 
written straight to memory with a BRD, no 
correction/detection occurs in DMA. 
ce Nga pee ee es ee a 
CifSbe RWI1C Cache Interface Single Bit Error. A single bit 
a error has been corrected in data coming from 


Bs 
= 
Q 


13 
struction Memory. 
Dmem Double Bit Error. A double bit error 
11 CopyDbe 
curs in DMA. 
RxpDbe RWI1C Receive Port Double Bit Error. Bit 8-++N de- 
t | the CSW. See note in CifDbe description for 
2:0 


when ECC correction occurs. 

Imem Single Bit Error. A single bit error has 
been corrected in data read from the Instruc- 
tion Memory. 

Dmem Single Bit Error. A single bit error 
has been corrected in data read from the Data 


Copy Port Single Bit Error. A single bit er- 
ror has been corrected in data read from the 
copy port packet buffer or read/write memory 


buffer. See note in CopyDbe description for 


CopySbe 
when ECC correction occurs. 


RxpSbe RWIC | 0 Receive Port Single Bit Error. A single bit er- 
May 14, 2014 ror aas been corrected in data read from the Rev 51328 
receive port packet buffer or operand regfile. 
Bit 0+N describes errors from RX port N. See 
note in RxpDbe description for when ECC cor- 
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5.7.0.10 Interrupt Mask Register 


An interrupt mask register allows software to control which kinds of interrupts will cause the DMA’s slow 
interrupt line to be asserted. Let’s imagine that only double bit errors are of interest; software would write ones 
in RLSDmaIntMask for the bits corresponding to the double bit error interrupt causes in RLSDmaIntCause. Then, 
if any double bit error occurs, RLSDmaIntCause bit 31 would go up and the slow interrupt line would be asserted. 
If any other kind of error occurs, the RLSDmalIntCause bit would still go up, but bit 31 and the slow interrupt line 
would not be affected. 

Register 

R_SDmalInt Mask 

Address 


OxE_0100_0204 
Attributes 
-kernel 


fools | ee = a iS 2s] 
30:0 | IntMask RW If the corresponding interrupt cause bit is ever set, assert 
the interrupt. 


5.8 SCB Performance Events 


The following events are trackable by SCB statistical event counting. 


Enum 


DmaScbEvent 


Attributes 


-descfunc 


8’h09 READ_MISS cif.csr_ReadMiss_ca: Block reads that missed 
in L2 cache 


ShOA READ_HIT cif.csr_ReadHit_ca: Block reads that hit in L2 
cache 


8’hOB WRITE_MISS cif_csr_WriteMiss_ca: Block writes that 
missed in L2 cache 

8’hoC WRITE_HIT cif_csr_WriteHit_ca: Block writes that hit in 
L2 cache 


SHODSHF | _____————S——* Reserved 
8’h20 COPY_MEMIN_PBUF cif_copy-MemInPbufSelc4a: Cache blocks 
copied from memory to copy port 


8’h21 COPY_MEMIN_RWMB cif_copy_-MemInRmbSelc4a: Cache blocks 
eee howe el copied from memory to r/w mem buffer 
8’7h22 COPY_MEMOUT_PBUF | cif.copy_-MemOutPbufSel_c2a: Cache blocks 
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8’7h23 COPY_MEMOUT_RWMB | cif.copy_-MemOutWmbSel_c2a: Cache blocks 


copied from copy port to memory 


8’h24 TXPO_MEMIN cif_txp_MemInTxp0Selc4a: | Cache blocks 
Ped copied from memory into TX port 0 

8’7h25 TXP1MEMIN cif_txp_MemInTxp1Selc4a: Cache blocks 
Pe copied from memory into TX port 1 

8’h26 TXP2_MEMIN cif_txp_MemInTxp2Sel_c4a: Cache blocks 
ce copied from memory into TX port 2 

8’h27 RXPO_MEMIN cifrxp_MemOutRxp0Sel_c2a: Cache blocks 
eee of copied from RX port 0 to memory 


8’7h28 RXP1MEMIN cif.rxp_MemOutRxp1Sel_c2a: Cache blocks 
copied from RX port 1 to memory 


8’7h29 RXP2_MEMIN cif.rxp_MemOutRxp2Sel_c2a: Cache blocks 
copied from RX port 2 to memory 
ShIA-Sh3F 


8’h40 UE_INSTR_VALID ue_xxx_DbgValid_c2a: Instructions executed 
al in microengine 

8’h4l START_IO cif_ue_StartIo_cla: I/O reads, writes, and SP- 
oe CLs received by DMA. 


8’h42 TASK_START ue_cif_TaskStart_cda: CSW _ operations 
started by the microengine. 


8’h43 COPY_PORT_PKTS ue_copy_TxThreadDone_c5a: Packets trans- 
ferred out of the copy port. 


8’h44-FF Reserved. 


5.9 Internal Data Formats and States 


The data formats for some internal buses are documented here in the spec to help the SystemC and Verilog 
models stay in sync with each other. The only people who would care about these formats are the SystemC and 
Verilog authors. Everyone else can safely ignore this section. 


5.9.1 Encoding of Buses between DmaCsr and DmaUe 
5.9.1.1 CsrUeStat - For csr_ue_Stat_ca bus 


Class 
CsrUeStat 


d1[63:0} | Ul sss Unused. Drive 0. 
d0[63:3} | U0 =| Ss:s«Unsed.. Drive 0. 


do[2] EnableEcc [| | Enable ECC correction on Imem 
dO[1:0} | FlipMemBits |_| XOR these bits with Imem data before writing. 


5.9.1.2 UeCsrStat - For csr_ue_Stat_ca bus 


Class 
UeCsrStat 
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aTGsa2] | SloepConds |__| Connect to mSlepCond_caaas] Sd 
[ar[31-0] | SleepConds [| Connect to m-SleepCond-c2al3i-0) —SS—~S 
raoese) [UO | id Used. Drive SSCS 

[a053] —[PrevThread_[ | Which thread van last 9) SSCS 
d 


O[1] DoubleBitErr ECC corrector detected a double bit ECC error while 
reading instruction memory. 
do|0} SingleBitErr ECC corrector detected a single bit ECC error while read- 
ing instruction memory. 
5.9.2 Encoding of Buses between DmaCsr and DmaCif 


5.9.2.1 CsrCifStat - For csr_cif_Stat_ca bus 


Class 

CsrCifStat 

Paes] pur |__| Unused. Dave 

Faofes:s) [00 [J tinued Drive 0 
| d0[2} | EnableEcc =| ‘| Enable ECC correction 


d0O{1:0} | FlipMemBits XOR these bits with the output of the ECC generator for 
cif_xxx_MemOutDwl_c4a during 32-bit realignment. 


5.9.2.2 CifCsrStat - For csr_cif_Stat_ca bus 


Class 

CifCsrStat 

Passa] [ uri SSSSC*di ed, De SSCSCSC~—~—SCSC~*S 

Pat[rr12] | DataArbGm [| S-bit arbitration counter for data queue selection 

Paifiao) [Ui iP SSSS—~d med Deve SSCS 

Pars] CmdbCe |__| bit arbitration counter for command quene selection 
4 a 


9:8] 
d1[7:4] DataSelQueue | DmaCifQueueNum | Which data queue was selected to go onto the dma_csw 
data bus? 
3:0] 


d1[3:0 CmdSelQueue | DmaCifQueueNum | Which command queue was selected to go onto the 
eee pees | ncowcommandtat | 


| 
a 
en 
do0[25] DoubleBitErr ECC corrector detected a double bit ECC error during 
| siete oft fo te CSW 
d0(24] SingleBitErr ECC corrector detected a single bit ECC error during 32- 
oe | eretigamen of t te CSW 
eT 
| do[t5:12] | WriteTidBusy | | Provide WriteTidBusy in RSDmaCifStatusl 
| CC Provide ReadTidBusy in R-SDmaCifStatusl | 
ay 
nl 


Connect to OWT valid bits 
Connect to ORT valid bits 


d0[23:16] | RefCntZero Provide RefCntZero in R-SDmaCifStatus2 


aor] | OwtValid 
-d0i3:0| | OrtValid 


aBusy 
d0(i1:8] | ReadTidBusy 
do[7:4] 
d0/3:0] 


5.9.3 Encoding of Buses between DmaCsr and DmaDmem 
5.9.3.1 CsrDmemStat - For csr_dmem_Stat_ca bus 

Class 

CsrDmemStat 
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d1[63:0] a Unused. Drive 0. 
i 7 a ia Unused. Drive 0. 


EnableEcc |__| Enable ECC correction 
mater FlipMemBits | __—| XOR these bits with Dmem data before writing. 


5.9.3.2 DmemCsrStat - For csr_dmem_Stat_ca bus 


Class 
DmemCsrStat 


d1[63:0] i a a Unused. Drive 0. 
d0|63:2] ie © | Wanseds Drives: ~~ 2° tet 8s | Drive 0. 


(1] DoubleBitErr ECC corrector detected a double bit ECC error while 
fl scan IB =" ahead 
dol SingleBitErr ECC corrector aeecica a single bit ECC error while read- 
eee LL Hingisteinnoge | 


5.9.4 Encoding of Buses between DmaCsr and DmaTxp 


5.9.4.1 CsrTxpStat - For csr_txp_Stat_ca bus 


Class 
CsrTxpStat 


Definition 
di [63:0 or Unused. Drive 0. 


| d0[63:2] | 63:2] | |: Unused. Drive 0. 
BnGee 1:0] EAT ne these bits with ALU result data before writing to 
packet buffer or operand register file. 


5.9.4.2 TxpCsrStat - For csr_txp_Stat_ca bus 


Class 
TxpCsrStat 


d1[63:0] | Ul ——s«|~——_~«|: Unused. Drive 0. 
do[63:10) [UO || Unused. Drive 0. 


d0[9:8] TxWhichBuf |_| Provide TxWhichBuf in R-SDmaPortStatus|[X] 
d0[7:0] TxBufState | | Provide TxBufState in RSDmaPortStatus[X] 


5.9.5 Encoding of Buses between DmaCsr and DmaRxp 
5.9.5.1 CsrRxpStat - For csr_rxp_Stat_ca bus 

Class 

CsrRxpStat 


di630] | Ul | | Unused. Drive 0. 
do[63:1] | U0 | | Unused. Drive 0. 


do|0| EnableEcc |__| Enable ECC correction 


5.9.5.2 RxpCsrStat - For csr_rxp_Stat_ca bus 


Class 
RxpCsrStat 


May 14, 2014 a7 Rev 51328 


SiCortex Confidential CHAPTER 5. DMA ENGINE 


d1[63:0] _——EEEn ee Unused. Drive 0. 
d UO Unused. Drive 0. 


0(63:12 
d0(11] 
0/10] 


5.9.6 Encoding of Buses between DmaCsr and DmaCopy 
5.9.6.1 CsrCopyStat - For csr_copy_Stat_ca bus 


Class 
CsrCopyStat 


Definition 


d1[63:0]} | Ul sss Unused. Drive 0. 
d0[63:3)} | U0 =| s:s«Unsed. Drive 0. 
do{2] EnableEcc Enable ECC correction 


dO{1:0} | FlipMemBits XOR these bits with ALU result data before writing to 
packet buffer or read/write memory buffer. 


5.9.6.2 CopyCsrStat - For csr_copy_Stat_ca bus 


Class 
CopyCsrStat 
Definition 


DoubleBitErr 


dl 
do 
do 
do 
do 
do 
do 


13:12] | CopyRxWhichBuf 
CopyBufState 


:0] 
:18) 
15:14] | CopyTxWhichBuf 
12] 
:0) 


provide CopyBufState in RLSSDmaCopyPortStatus 


May 14, 2014 278 


i es 
DoubleBitErr ECC corrector detected a double bit ECC error while 
reading the packet buffer or operand memory. 


SingleBitErr ECC corrector detected a single bit ECC error while read- 
eee | [inte pate fro open enor 
aps] | RAWhichBut_| | Provide RxWhichBuF im R_SDmaPortStatus(X] 
Pd0(r0| _[ RxBuiState [| Provide RxBufState m FLSDmaPortStatus[X] | 


P_[Unsed Dived—SOSC~=“*“*S*~“‘~“S*~*~*~S*~*S 
[Unused Dive —SSOSCSCSSSSC~*?r 

ECC corrector detected a double bit ECC error while 

|_| eating he ptr baer onsen te memory bl 

SingleBitErr ECC corrector detected a single bit ECC error while read- 
MN || ngte packet bro rd te manor ae 
- | provide Copy TxWhichBuf m R_SDmaCopyPortStatis | 

[provide CopyRsWhichBuFin R-SDmaCopyPortStatus | 

|| 
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Chapter 6 


Processor Segments 


[$Id: processor.lyx 47578 2007-11-16 21:54:43Z wsnyder §] 


6.1 Overview 


The SCX1000 includes six identical processors implementing the MIPS64 Architecture including floating point. 
Each CPU is a MIPS 5kf with custom extensions. (MIPS may rename our re-derived CPU, but for now, we’ll con- 
tinue to call it 5kf.) The processor segment contains one CPU, its associated 256KB L2 cache segment, maintenance 
and control registers, and the processor interrupt controller. 


6.2 Specifications 


Each processor has the following major features, with features we’ve changed or configured from the base MIPS 
5kf indicated in bold: 


e 64-bit Data and address path 
e 42-bit Virtual and 36-bit physical address space 
e MIPS64 Compatible Instruction Set 


Multiply-Accumulate and Multiply-Subtract (MADD, MADDU, MSUB, MSUBU) 
— Zero/One Detect (CLZ, CLO, DLCO, DLCZ) 

— Conditional Move Instructions (MOVZ, MOVN) 

— Prefetch Instructions (PREF, PREFX), including L2 prefetches 


e Dual issue super-scalar architecture, capable of simultaneously executing: 


— 1 integer and 1 arithmetic floating point 


— 1 floating point arithmetic and 1 floating point store 
e Floating Point 


— JEEE 754 compatible 

— Single and double precision 

— Multiply and add instruction 

— Issue one multiply add double every clock 


— Fast flush-to-zero mode to optimize performance 
e Multiply/Divide Unit 


— Issue one 32x16 multiply every clock 
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— Issue one 32x32 multiply every other clock 
— Issue one 64x64 multiply every nine clocks 
— 37 clock latency on 32/32 divide 
— 69 clock latency on 64/64 divide 


— Early-in feature returns division results sooner for smaller dividends 


e Memory Management Unit 


— 48 dual-entry JTLB 

— 4-entry instruction micro TLB 

— 4-entry data micro TLB 

— 16 KB to 16 MB page sizes. (Note 4KB pages are not supported.) 
— 8 bit ASID. 


e Caches 


— 32 KB 4-Way Data cache 

— 32 KB 4-Way Instruction cache 

— Write-back and write-allocate 

— Non-blocking loads 

— 32-byte cache line size 

— Virtually indexed, physically tagged 

— Support for locking cache lines 

— Non-blocking prefetches 

— ECC protected Data Cache, parity protected I Cache 


e Bus Interface Unit 


— Separate 32-bit address request bus and 64-bit data bus 

— Four 64-bit IO write buffers 

— One 32-byte eviction buffer 

— Load Linked, Store Conditional multi-processor support 


— SYNC instruction support 


e Independent intervention (probe) bus 


— Probing of D-Cache, Write Buffers 


e Performance Monitoring logic 


6.3. User Code Visiable Bugs and Enhancements 


6.3.1 Product and Chip Pass Differences 


1. 
2: 


ICE9B returns a different product (ICE9B) when reading R-CpuPRId and R_CpuTapIDCODE. 


ICE9B fixes bug1965 whereby R_CpuErrCtl reads swap bits 31 and 28. In ICE9A any read-modify-writes 
need to swap these bits before writing them back. 


. ICE9B improves micro DTLB performance bug 2200 with a entry size of 64KB when the corresponding 


TLB entry is 64KB or larger. If the TLB entry is 16KB, the old 4KB uTLB entry size is used. 


. ICE9B improves probe performance by using 64 byte probes, see bug2202. 
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. ICE9B removes an unnecessary syncronizer on the cac_cpu_int wires, this reduces interrupt latency by one 


pelk. 


. ICE9B adds performance counter events for L2 misses and floating point operations, and allows all events 


to be visible to both counter 0 and counter 1. 


. TWCOOYA returns a different product (TWC9A) when reading R-CpuPRId and R_CpuTapIDCODE. 


. TWC9A uses a new core, IceT. This is described in a different document. 


6.3.2. Known Bugs and Possible Enhancements (M5KF only) 


1. 


On D-Cache ECC errors, R-CpuCacheErr_EW may record the incorrect way number and index, see 
bug1575. As a workaround, software should flush the entire cache on ECC errors. 


. On filling the TLB with a 4KB page, we should pull a machine check, as 4KB pages are not supported. 
. On writes to accelerated space, we should pull a machine check, as they are not supported. 


. We should add a 64-bit cycle counter which is NOT writable, as the current count register is occasionally 


overwritten by the kernel, bug3342. 


. We should implement the RDHWR instruction so user space code can see the cycle counter and processor 


number. 


. We should add more VA bits, to enable the VA to be unique across the entire system. 


6.4 Kernel and Performance Bugs and Enhancements 


6.4.1 Product and Chip Pass Differences 


1. 


Dy 


ICE9B returns a different product (ICE9B) when reading R-CpuPRId and R_CpuTapIDCODE. 


ICE9B fixes bug1965 whereby R_CpuErrCtl reads swap bits 31 and 28. In ICE9A any read-modify-writes 
need to swap these bits before writing them back. 


. ICE9B improves micro DTLB performance bug 2200 with a entry size of 64KB when the corresponding 


TLB entry is 64KB or larger. If the TLB entry is 16KB, the old 4KB uTLB entry size is used. 


. ICE9B improves probe performance by using 64 byte probes, see bug2202. 


. ICE9B removes an unnecessary syncronizer on the cac_cpu_int wires, this reduces interrupt latency by one 


pelk. 


. ICE9B adds performance counter events for L2 misses and floating point operations, and allows all events 


to be visible to both counter 0 and counter 1. 


. TWCOOYA returns a different product (TWC9A) when reading R-CpuPRId and R_CpuTapIDCODE. 


. TWC9A uses a new core, IceT. This is described in a different document. 


6.4.2 Known Bugs and Possible Enhancements (M5KF only) 


1. 


On D-Cache ECC errors, R-CpuCacheErr_EW may record the incorrect way number and index, see 
bug1575. As a workaround, software should flush the entire cache on ECC errors. 


. On filling the TLB with a 4KB page, we should pull a machine check, as 4KB pages are not supported. 
. On writes to accelerated space, we should pull a machine check, as they are not supported. 


. We should add a 64-bit cycle counter which is NOT writable, as the current count register is occasionally 


overwritten by the kernel, bug3342. 
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5. We should implement the RDHWR instruction so user space code can see the cycle counter and processor 


number. 


6. We should add more VA bits, to enable the VA to be unique across the entire system. 


6.5 Complete Documentation 


For complete information on the MIPS 5kf core, see the documentation provided by MIPS. The remainder of 
this chapter will discuss only the bus interface and items being changed inside the CPU. 


(Tech Pubs: Remove this and insert the relevant 5KF documentation.) 


6.6 BIU Description 


The CPU bus interface connects the CPU with the associated L2 cache. The BIU interface is based upon the 
default 5kf interface, with some extensions as described below. 


6.6.1 BIU Ports 


Signals corresponding to original MIPS 5kf BIU signals are listed below. The capitalized middle part of the 
signal always corresponds to the original MIPS signal name with EB_ prepended, for example cpu_cac_reqAValid_pr 


corresponds to EB_AValid. 


Name 


cac_cpu_reqA Rdy_pr 
cac_cpu_reqW DRdy_pr 
cpu_cac_reqAValid_pr 


cpu_cac_reqAddr_pr[35:3] 
cpu_cac_ al (7:0] 


cpu_cac_ SegBhapa [1:0] 


cpu_cac_ ee 


cac_cpu_int_p[3:0] 


Description 

Cache ready for new address, CPU may send _reqAValid in the next 
Cache ready for new write data, CPU may send write data in the ne 
Address bus and access type are valid this cycle. 

Read/write transaction address. 

IO transaction byte enables. 

Burst transaction; reqBFirst, reqBLast and reqBLen indicate the sta 
First cycle of multiple-cycle burst. May not be needed, as can be det 
Last cycle of multiple-cycle burst. May not be needed, as can be det: 
Number of cycles in burst. Not valid for non-bursts. 

Read is for an instruction fetch. Data will go to the I-Cache and so t 
Write data. 

Write, not a read. 

Read return data is valid this cycle. 

Read return is in error. (unused, tied false) 

Read return data. 

CPU is waiting for write buffers to empty. This may be used to re-p! 
Six bit interrupt request mask. Top two bits are tied to 0. 


The following signals have been added to the base design: 
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inJOut 


cpu_cac_reqCmd_pr|2:0] Requested command. Valid when 
cpu_cac_reqVld_pr is asserted. See 6.26.1 on 


cpu_cac_reqRId_pr Requested read identifier. For reads or prefetches, 
this indicates which CPU read-id needs to be 
indicated with the eventual return and retirement. 


cac_cpu_rtnP MHit_pr Read return hit in L2 Cache. Valid when 

cac_cpu_trtnRdVal_pr asserted for cachable 

addresses. 

cac_cpu_rtnPMState_pr[2:0] Read return CacState. Valid when 

cac_cpu_rtnRdVaLpr asserted with 

cac_cpu_rtnP MHit_pr. 

cac_cpu_rtnP MStop_pr/3:0] Read return bus stop number. CswStopNum for 
memory (non IO) read data, valid when 
cac_cpu_trtnRdVal_pr asserted. 


cac_cpu_rtnRId_pr[2:0] TWC9A+ | Read return identifier. When cac_cpu_rtnRdVal_pr 
asserts indicates which read return the data is for. 


This is the identifier requested with 
cpu_cac_reqRId_pr. 


cac_cpu_rbDone_pr[7:0] TWC9A+ | Read buffer completion. When a bit pulses for one 
cycle, the corresponding cpu_cac_reqRId_pr 
number may now be retired and reused. If it’s 
reused, this same number may appear on 
cpu_cac_reqRId_pr as soon as the cycle after next. 
This handshake is independent of 


cac_cputrtnRdVal_pr, as it has the flexability to 
hold a buffer until a TID is done, and alows 
multiple TIDs to retire at once. 


Sync Holdoff. Asserted to indicate sync 
instructions must be held off. Must first assert two 
cycles after cpu_cac_reqAValid_pr & 
cac_cpu_reqARdy_pr are asserted, and cleared 
when sync instructions may complete. 

Pulsed to indicate a IO write buffer has been 
emptied on the L2 side, and a credit should be 
added to the buffer count. 


Probe address. Note wrapping request on [4:3] is 
only a hint, and cannot be guaranteed to be the 
order returned by the CPU. In fact, it is always 0. 
In pass2, probes are 64 bytes, and bit[5] is ignored. 
Intervention acknowledge, invDirty indicates hit 


Intervention hit on cache or write buffer. L2 must 
not require this signal for correct protocol, it is for 
statistical, verification, and debugging use. In 
ICEQA, this is a single bit signal, in ICE9B+ it 
indicates the status of each 32B half of the 64B 


cpu_cac_invDirty_pr/[1:0] Intervention hit on dirty cache or write buffer. In 
ICEQA, this is a single bit signal, in ICE9B+ it 
indicates the status of each 32B half of the 64B 
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L1 Miss Read Return 
Read 


pale a SS ROR NAT SO OOS OP ROH 


cpu_cac_dcMiss_pr $$ 


cpu_cac_dcAddr_pr[35:3] > 7 <i, 


cpu_cac_reqAValid_pr $$ 
cou_cac_eqWric pr re, cod aS Se 
cpu_cac_rec/Adct_pr{35:3) [I< 37>” ‘Sn 
cpu_cac_reqBE_pr{7:0] La FF 4 (gu 


cpu_cac_reqBFirst_pr $§ 
cpu_cac_reqBLast_pr S$ 
cac_cpu_rtnRdVal_pr $$ 


cac_cpu_rtnRData_pr{63:0} i, SD OCD DD a 


cpu.deload_r $§ 
cpu.deload_e 5 5 
cpu.deload_m S 5 
cpu.dcrbrd 5 sf 
cpu.ccrbaccr > 2 <li 
cpu.dcrbmiss S § 
cpu. rbbiadr a > SS 
cpu. rbbird 5 § 
cpu.birddatardy 5 5 


cpu bidet a, SD D0 DD a 
cpu rddcdate A, SD 0 0D D4 


Figure 6.1: BIU Read Transaction Timing 


6.6.2. D-Cache Reads 


D-Cache transactions begin with a load instruction in the R stage of the pipe. The address is determined to 
miss in the L1 D-Cache, and the speculative miss dcMiss and dcAddr signals are asserted. The transaction is sent 
to the BIU. If there was dirty L1 data to be evicted, it is extracted and added to the write buffer, and becomes a 
write transaction described below. 

The BIU issues the read request to the L2 by asserting reqAValid_pr with a burst length of 4 (there are four 
64-bit chunks in the 32B cache line.) When the L2 completes the request, the L2 places the four data bursts on 
rtnRData_pr, and asserts rtnRdVal_pr with the read identifier on rtnRId_pr. The return order of data must match 
that requested. When the TID is completed, the L2 asserts rbAck_pr with the read identifier on rbRId_pr. 

If the processor attempts a DCache read to a block in the SHARED state, the L2 lookup will result in a MISS. 
This will cause the SHARED block matching the target address to be “victimized” (that is, replaced in the L2) and 
a RDEX to be issued to the CSW to fill the block from main memory. 


6.6.3. I-Cache Reads 


Instruction cache reads look the same to the L2 cache as data stream reads. The CPU indicates the read is for 
I-Stream by asserting reqInstr_pr along with the address. The L2 may use this to fill the L2 cache in shared state. 
Since interventions do not probe the I-Cache, instruction lines may be in multiple CPU I-Caches simultaneously. 


Istream accesses to L2 cache blocks in EXCL, DIRTY, or UPDATED states will result in an L2 cache hit. 
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Write Buffer Return 
pclk §§/ Sf/ 
cpu_cac_reqAValid_pr f { j f 


cpu_cac_reqWrite_pr IY Write nn ee 
cpu_cac_reqAddr_pr[35:3] MK AQ AG ATy A: qa (ee Gs 
cpu_cac_reqWData_pr[63:0] BX DO Do, Dy D: aa SE 
cpu_cac_reqBE_pr[7:0] HK EE ey inns 


cpu_cac_reqBFirst_pr j f j f 
cpu_cac_reqBLast_pr j j j j 

cac_cpu_wbloAck_pr j f (if j f 
cpu_cac_syncBusy_pr j j S§ 


Figure 6.2: BIU Write Transaction Timing 


6.6.4 Istream Initial Reads 


The L2 cache supports I-Stream accesses while the L1 cache was disabled. This allows booting of the processor, 
and cache trap handlers which enter non-cachable mode. 


6.6.5 Evictions 


L1 evictions are handled by the standard MIPS interface. When a cache fill is required, the LRU line from 
the cache is read out and stored into the BIU write buffer. After the BIU places the read request on the bus, the 
eviction is requested, and the write data transferred. The L2 must assert syncBusy_pr one cycle after the write is 
received, and keep it asserted until the write is coherent, see 6.6.9. 

To prevent deadlock, the L2 cache must accept any number of evictions while a probe is outstanding. Evictions 
should thus always be able to be written back to the L2, and should never require Coh action (and thus potential 
deadlock.) 


6.6.6 IO Writes 


IO Writes are handled by the standard MIPS interface. IO Writes are distinguished by address bit [35] being 
set. The BIU places the write on the bus. The L2 must assert syncBusy_pr one cycle after the write is received, 
and keep it asserted until the write is coherent, see 6.6.9. The ICE9 chip does NOT support “accelerated uncached 
write bursts” from the MIPS core. The L2/CSW supports only one active IO write at a time, so IO writes are 
enqueued in the interface between the MIPS core and the CSW. (See Section 6.18.) 


6.6.6.1 IO Write Buffer Counter 


To prevent overrunning the write buffer in the L2 cache, the BIU keeps track of the number of L2 IO write 
buffer entries that may be in use. The count starts at 5 entries, the size of the CPU and L2 write buffer. As IO 
write buffer entries are allocated, the count is decremented, where a IO write is defined as a write with address bit 
[35] set. When a IO write reaches the L2 coherency point, the L2 asserts wbAck_pr, which increments the count. 


May 14, 2014 285 Rev 51328 


SiCortex Confidential CHAPTER 6. PROCESSOR SEGMENTS 


If the write buffer count minus the number of load/stores in flight is less than 2, on the next load/store the 
instruction pipeline stalls until a buffer is freed. (The extra buffer is due to pipeline delays in decrementing versus 
checking the count, covering the case of when there are back-to-back stores.) 


6.6.7 Cache Instructions 


The CPU implements the MIPS CACHE Instruction. The L1-D “hit writeback” cache instruction has been 
changed to instead perform “hit writeback and invalidate.” This prevents the L2 from seeing an eviction from the 
cache instruction and believing it is the probe return. (Thus, we can enforce the rule that after eviction, a line is 
always invalid.) 


6.6.8 Prefetch Instruction 


The CPU implements the MIPS PREF Instruction. 

ICE9 used the original core, which implements load and store hints identically, and the writeback invalidate 
hint. Prefetches issued when the cache pipeline was busy were silently dropped. 

TWC9 prefetches are not dropped when the cache pipeline is busy, however they are still dropped on a TLB 
miss; they never take exceptions. TWC9 also adds L2 prefetches, see 6.24.4 on page 319. 

TWC9 retains the rule that there may be only one miss at once. However, there may be as many as 4 misses 
and L2 prefetches outstanding. In addition, a second miss-under-miss will be automatically converted into an L2 
prefetch. This allows software to get most of the latency benefit of two misses outstanding even if prefetches have 
not been inserted into the code. 

For L2 prefetches, TWC9 issues a PREF command on cpu_cac_req@md_pr. Data is never returned. When 
the prefetch completes, the L2 asserts cac_cpu_rbAck_pr, with cac_cpu_rbRId_pr indicating which prefetch has 
completed. 

Note prefetches are not supported to DMSEG when in Debug mode, the behavior is unpredictable. It’s assumed 
there won’t be any prefetches in the debug handler. 


6.6.9 Sync Instruction 


The SYNC instruction requires all loads and stores that occurred before the SYNC to be completed before any 
loads or stores following the SYNC. In our multiprocessor system, this requires all loads to be completed and have 
results in the register file, that all cacheable stores have invalidated other CPUs caches, and that all non-cacheable 
I/O stores have reached the point at which they are ordered with respect to all other CPUs. 

Load/Sync ordering is be insured by stalling any SYNC until all loads have reached the register file. The original 
CPU has code for this, but it should be verified. 

Cached Store/Sync ordering is insured by the L2 Cache asserting cac_cpu_syncBusy_pr until all stores have 
completed, including invalidating the caches of other CPUs. 

IO Store/Sync ordering is also be insured by stalling the SYNC until cac_cpu_syncBusy_pr. syncBusy must 
remain asserted until the IO store has reached the IO write coherence point. 


6.6.10 Load Linked and Store Conditional 


The Load Linked (LL, sometimes also called Load Locked) and Store Conditional (SC) instructions are used 
to implement critical sections. A LL instruction loads a memory location, remembers the address loaded and sets 
the lock bit. The following SC returns the lock bit to the register file, and if the lock bit was set, performs a store. 
Any store or DMA write (not just a SC completing) to the same address causes the lock bit to clear. 

To implement this scheme, we take a simple approach; we prevent any other processor from gaining access to 
the locked line for a certain holdoff time. 


e On executing a LL, we set the lock bit, and save the locked address. We start a timer, the Locked timer, 
which counts up to 8 then resets. (Programmable from 8 to 1K in powers of two with R-CpuConfig_LLTime.) 


e On executing a SC, we test the lock bit, and reset the locked timer. 


e On executing a ERET, we clear the lock bit. 
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Probe Probe Return Write 


LTC AV AGS VAVAUA) AV AUAVAVAVAUD) AVAVAY 


cac_cpu_prbReq_pr ff ff §§ 


cac_cpu_prbAddr_pr[35:3] IX 74} SEE (0 


cpu_cac_invAck_pr § §§ §§ 
cpu_cac_invDirty_pr §§ §§ §§ 
cpu_cac_reqAValid_pr f i f f j f 


cpu_cac_reqWrite pr Se Gay = Write Wael 
cpu_cac_reqAddr_pr[35:3] SS SS 7777 at 
cpu_cac_reqWData_pr[63:0] SS Sl Ga CC Te 
cpu_cac_reqBE_pr[7:0] i Sams Say FF SS 
cpu_cac_reqBFirst_pr §§ §§ §§ 


cpu_cac_reqBLast_pr f f f f j f 
Figure 6.3: BIU Probe Timing 


e While the locked timer is counting, all probes will be held off, and the CPU is free to (hopefully) complete the 
lock sequence. Note 8 cycles is enough to complete all Linux locks, and other locks we know about. Should 
the lock complete, or the SC never execute, all is fine, otherwise: 


e If a probe occurs outside the locked timer interval, and the probe address matches the lock address, the lock 
bit is cleared. 


e To prevent code that does LL inside a tight loop from livelocking out other CPU’s probes forever, after the 
locked timer has been used for N cycles, the lock timer will not work for another N cycles. A SC is still likely 
to be succeed during this time; however it is not guaranteed to succeed as it otherwise would. 


Note at all times lock semantics are preserved; there is no case where write data could interfere with the critical 
section. 

Should software have large lock sequences over 8 instructions, there may be performance problems. To mitigate 
this, we make the interval programmable, and have an SCB event to track clearing of the lock due to probes. 


6.7 Interventions 


The CPU has an intervention bus to maintain coherency between the cores. The bus runs at processor clock 
frequency, and consists of an address, command, and acknowledgment back to the L2. 

The intervention bus is presented with an address from the L2 cache. First, if any load/stores are in the pipeline, 
the pipeline is stalled. 

This intervention address is looked up in the L1 tag store array. A clean hit will invalidate the line in the L1 
D-Cache. A dirty hit will stall the load/store pipeline, and grab the D-Cache for four cycles. The data is read from 
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the D-Cache in 0-3 order and placed into the CPU’s eviction buffer. This requires the evicion buffer be free; if not, 
the extraction stalls until space becomes available. 

The intervention address is also compared against the CPU load/store buffer, this insures data is returned for 
hits on stores waiting for the L1 cache. A match will return dirty hit, and the L2 is responsible for retrieving the 
data from the stream of write data. 

The intervention address does not need to be compared against read requests. A match against an unissued 
load can be ignored, as when it finally issues in the L2, the data will have been returned. A probe will not be issued 
against a issued load, as this is guarded by the L2 line-collision CAM. 


6.7.1 Intervention Deadlock Avoidance 


The intervention scheme requires that the load/store and eviction buffers makes forward progress; however the 
buffers may contain write transactions that have not yet reached the L2 cache and thus are before the coherency 
point. To prevent this resource loop from resulting in a deadlock, the L2 must insure that CPU reads and writes 
can always be drained. When the L2 is accepting transactions, (that is, when it is asserting ARdy) it will accept 
and process Ll writebacks and all other writes in order and without queuing. If necessary in handling probes 
the L2 interface will enqueue cache read operations for processing after the completion of writeback or probe 
operations. The space required for the “pending read queue” is relatively small, as the processor is limited to just 
two outstanding READ operations at a time. (See 6.15.3.) 


6.7.2. Example Intervention Cases 


1. Not in D-Cache The CPU acknowledges the intervention as a miss. 

2. Intervention 

1. Clean in D-Cache The CPU acknowledges the intervention as clean and invalidates the 
2. Intervention D-Cache. 


1. Dirty in D-Cache The CPU acknowledges the intervention as dirty, reads the data 

2. Intervention from the cache and places into the write buffer. The write is made 
to the L2. 

1. Miss in progress, not issued by | The CPU acknowledges the intervention as a miss. This is correct, 

L2 as the CPU miss is ordered after the intervention. 


2. Intervention 
1. Miss in progress, Illegal. As the L2 has not returned the data, the L2 is required to 


issued by L2, data not to CPU yet | stall issuing the intervention until it does so. 

2. Intervention 

1. Miss in progress, The CPU stalls the intervention on read data buffer hit until the 

issued by L2, data sent to CPU miss updates the L2, and then the intervention becomes a L1 hit 

2. Intervention case. 

1. In write buffer The CPU acknowledges the intervention as a dirty hit. The write 

2. Intervention will propagate to the L2 as writes normally do. 

1. In D-Cache The CPU stalls the intervention until the load or store completes; 

2. Load or store in M or W-stage | additional loads or stores will stall if to the same D-Cache index. 

3. Intervention (The physical address is not known in time, and the index is 
identical between the VA & PA.) 


6.8 WAIT 


The CPU includes the WAIT instruction which places the CPU into power down mode until an enabled interrupt 
occurs (generally, this is a timer interrupt that was configured just before entering sleep.) During sleep, the BIU 
will awaken to accept and return interventions, identical to normal awakened mode. 
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6.9 Interrupts 


The CPU provides 6 level sensitive interrupts. (It also has a non-maskable interrupt or NMI that is unused.) 
These first four of the six are activated by writes to the interrupt control register, the arrival of a slow interrupt, 
or via a CSW INT transaction. (See Section 7.10.5 and Sections 7.18.6 through 7.18.9.) The top two levels are 
reserved for causes internal to the processor. 


5 Pinta] Cac TORT/S and RCpuCompare timer interrupts, | 
Cac ICR2/3, generally PCI-E. 
[0 L/S [ Software interrupt from same core 


6.10 EJTag 


The MIPS EJTAG port is connected to the SysChain JTAG bus so that the cores may be debugged. In addition 
a syschain register allows a debug trap on one CPU to cause debug traps to be taken on all CPUs. 


6.11 D Cache ECC 


The D-Cache has been changed to use byte ECC instead of byte parity. This was done without changing the 
pipeline or any instruction timings. 


6.12 Scheduling Hazards 


The CPU has the same instruction hazards as documented in the M5KF Software Users Manual, Section 12.2, 
with the following exception. 

The original 5KF required a CACHE instruction not be followed by a memory operation for 2 instructions. 
This restriction is removed, any instruction may follow a CACHE instruction, including a load/store to the same 
cache line. 


6.13 Dual Issue 


The CPU has the same dual issue rules as the 5kf. As its documentation is a bit obtuse, here is a restating of 
the rules. 
Dual issue if all of the following are true: 


e Not in delay slot. 
e Single-issue bit is off. 


e The instruction will not trap. (IE to dual issue a COP1 instruction, COP1 must be enabled.) 


e One of the pair of instructions is: abs.*, add.*, c.*, ceil.*, cvt.*, div.*, floor.*, madd.*, mov.*, movef, 


msub.*, mul.*, neg.*, nmadd.*, nmsub.*, recip.*, round.*, rsqrt.*, sqrt.*, sub.*, trunc.*, MMDX with in- 
str[5:0]!=6’b0110x1, or COP2 instruction with instr[25]=1’b1. (Note this excludes Idxcl, luxcl, lwxcl, movz, 
movn, prefx, sdxcl, suxcl, swxcl.) 


e The other of the pair of instructions is: add, addi, addiu, addu, and, andiori, break, cache, dadd, daddi, 
daddiu, daddu, ddiv, ddivu, div, divu, dmfcl, dmtcl, dmult, dmultu, dsll, dsll32, dsllv, dsra, dsra32, dsrl, 
dsrl32, dsrlv, dsub, dsubu, lb, Ibyu, Id, Idc, 1de2, Idl, ldr, ldxc1, lh, lhu, ll, ld, 1tl, lui, luxcl, lw, lwel, lwe2, 
lwl, lwr, lwu, lwxcl, mfcl, mfhi, mflo, movn, movz, mtcl, mthi, mtlo, mult, multu, or, pref, prefx, sb, sc, sed, 
sd, sdel, sdc2, sdl, sdr, sdrav, sdxcl, sh, sll(excluding_nop), sllv, slt, sltiu, sltu, sra, srav, srl, srlv, stti, sub, 
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subu, suxcl, sw, swcel, swc2, swl, swr, swxcl, sync, syscall, teq, tge, tgeu, tltu, tne, xnor, xor, xori, or COP2 
instruction with instr[25:22|==4’b00x0. (Note this excludes cfcl, ctcl, deret, eret, jr, jalr, mfc0, movci, mtc0, 
ssnop.) 


6.14 Floating Point Pipeline Enhancements 


The floating point pipe was modified to increase the issue rate of double-precision multiply and fused-multiply- 
add instructions. These include mul.d, madd.d, msub.d, nmadd.d, & nmsub.d. The effect is to change the m5kf 
latency (5 cycles) and “issue rate” (2 cycles) for these instructions to 4 cycles & 1 cycle, matching the latency and 
“issue rate” of the corresponding single-precision version of the same instructions. As a side effect of the change, 
recip.d and rsqrt.d also come out with improved performance. 

In the original m5kf, the resources devoted to the multiplier array were reduced (optimized) by implementing 
half the hardware needed for a full double-precision multiplier and using the hardware on 2 consecutive cycles to 
complete a double-precision multiply. (Single-precision multiply operations don’t need the additional cycle, so they 
complete the multiply part of the operation in 1 cycle.) As a result, a multiply instr. following a d.p. multiply had 
to wait a cycle before issuing, since the hardware would still be in use for the 2nd cycle of the preceeding multiply 
instruction. By building the full hardware need for a d.p. multiplier, the issue rate was doubled and the latency 
reduced, for something like a 10-15% improvement in delivered performance. 

The aproach we’ve taken in implementing the ICE9 changes is to collapse 28 booth partial-products, plus 2 
injected constants into the sum-and-carry redundant-form representation of the multiply result in a single cycle. 
This requires 4 levels of CSA, one more than in 1 cycle of the m5kf multiplier. The additional CSA inserted into 
the cycle adds to the critical path in the multiplier array, but there was sufficient margin to make the insertion 
without impact to the chip clock frequency. The changes are illustrated in the following 2 figures. The first shows 
the organization of the m5kf multipler array. The 2nd shows the organization of the ICE9 multiplier array. 


lag EE ore constants 
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Figure 6.4: M5kf Multiplier 
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* Do pass-1 & pass-2 in parallel; zero-out “feedback” terms 
¢ Add a Level-4 CSA to combine the pass-1 & pass-2 results 


Figure 6.5: ICE9 Multiplier 


6.14.1 Floating Point Repeat Rate and Latency 
Bolded values indicate change from M5KF. 


Latency (cycles) | Repeat Rate (cycles) 
ABS.*, NEG.*, ADD.*, SUB.*, MUL.*, MADD.*, 1 
MSUB.*, NMADD.*, NMSUB.* 


C.cons.* to MOVD.* and MOVT.*/ MOVT, MOVN, BCI 
CVT.DS, CVT.S,D].[W.L] 


MOV.*, MOVD.*, MOVN.*, MOVT.*, MOVZ* 


LWCl, LDC, LDXC1, LUXC1, LWXC1 
MTC1, DMTC1, MFCI, DMFC1 


6.15 The L2 Cache Segment and Pipelines 


Each processor in the ICE9 chip is directly connected to a 256KB L2 cache segment. All six cache segments 
are kept coherent via the Cache Switch interface (CSW) described in Chapter 7. Most of the L2 cache (CAC) 
runs at the central CCLK rate, only the interface to the processor contains elements clocked on the processor clock 
(PCLK). 

The CAC talks to the processor through the SLC unit. The SLC is responsible for retiming processor requests 
from PCLK to CCLK and retiming responses in the opposite direction. It processes all write requests as they are 
issued by the processor, and may enqueue read requests if necessary. All read requests are processed in order: reads 
don’t pass reads. Similarly, all writes are processed in order. However, to correctly handle the case of a Dstream 
L1 miss that requires a victimization of an L1 or L2 block followed by an Istream L1 miss, we allow writes to pass 
reads. 

The CAC also connects to the CSW. Probes are handled in order of arrival, but may be enqueued for an 
arbitrary number of cycles. 


6.15.1 The Tag Lookup 


The L2 segment is optimized to handle DCache misses in the absolute minimum number of cycles. Figure 6.6is a 
sketch of the pipeline from tag lookup to CSW command generation. From the delivery of the dcache miss address 
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Figure 6.6: L2 Tag Lookup Pipeline 


at the edge of the MIPS core (a stage that we’ll label PO) to the command out to the CSW the path takes two 
pipeline stages of 2nS each, plus a possible realignment penalty of 2nS (to align the PCLK request with the CCLK 
domain) plus 4nS for the tag lookup, and a 4nS stage for driving the command to the CSW. 

The last stage of logic in the lookup pipeline determines whether the PS needs to do a memory read, what kind 
of read command the PS should issue, and where it should go. Algorithm 6.1 describes the policy for chosing which 
command to issue and which way to victimize. Note that we don’t wait to find out if the victim block is really dirty 
(requires a writeback), but instead assume that all blocks in the EXCLUSIVE, MODIFIED, or UPDATED state 
require a writeback. While we’re launching the CSW request we’ll start an L1 cache probe operation to acquire 
the dirty data (if any). If the displaced block is dirty, we’ll drive the data onto the CSW when it is ready. If we 
find that the block was not dirty in the L1 cache AND it was clean in the L2, we’ll send a WBCANCEL command 
to the appropriate coherence widget. 

Figure 6.7 shows the pipeline and general organization of the L2 Tag and State arrays. All components in this 
section run off the central clock. Note the four way mux at the top of the pipeline. Addresses enter from either the 
processor BIU, or the CSW fill and probe path. The address path from the BIU is required to support flush and 
writeback operations and for I-stream fetches. 

The Tag arrays are ECC protected. Each array contains 2K words of 26 bits each. The actual tag is 18 bits 
wide (address bits 34 through 17). The state information requires 3 bits. For the 20 data bits, we'll require 6 bits 
of SECDED ECC. The two banks are independently corrected to allow for independent updates. If the two tags 
are merged, the total storage requirement would be 2K words by 47 bits. Corrected words are not written back to 
the array. In the event of an ECC error, the L2 controller will signal an ECC error interrupt to the processor and 
the processor will initiate a flush of the L2 cache. Double bit errors will signal a machine check. 

A block in the L2 is in one of five states: 


INVALID: No data is stored in the associated block. All tag comparisons against this block will fail to match. 


EXCLUSIVE: This block was filled in response to a DCache miss. The data in the block is identical to the copy 
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Algorithm 6.1 L2 Lookup Pipeline - CSW Command Generation 
if (miss address is I/0 space) { 
issue RDIO or WIIO as appropriate, to the Mcorrectm bus stop. 
// see Section 6.18 
} else if ((WayOMiss AND WayiMiss) OR 
(DFETCH AND (WayOHit AND (WayOState == SHARE)) OR 
(WaylHit AND (Way1State == SHARE))){ 
csw address = miss address 
select victim as per table 6.1 OR by the MDFETCH to SHAREm rule below. 
cmd way = victim way 
if address<6> csw destination = COHO 
else csw destination = COHE 
if (access is IStream) { 
if (victim state is SH or INV) csw command = RDSH // istream read with no write- 


back 
else csw_command = RDSV // read with a possible victim writeback 
- 
else { 
if (victim state is SH or INV) csw command = RDEX // dstream read with no write- 
back 
else csw_command = RDV // dstream read with possible victim writeback 
7 


bid for the appropriate CSW chain. 
} 


DFETCH to SHARE victimization rule: 

if (DFETCH AND (WayOHit AND (WayOState == SHARE))) victim = Way0O 

else if (DFETCH AND (Way1Hit AND (WayiState == SHARE))) victim = Way1 
else find victim in 6.1. 
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Figure 6.7: L2 Tag and State Arrays — The Address Pipeline (All in CCLK domain) 
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Table 6.1: Victimization Rules 


of the data in main memory. The L1 cache may have a copy of the data that is newer still. 


MODIFIED: This block was filled in response to a DCache miss. The data in the block is newer than the copy 
in main memory. The L1 cache may have a copy of the data that is newer still. 


UPDATED: This block was filled in response to a DCache miss. Since the block was filled, the L1 cache has 
written data through to this block. The L1 cache may have a copy of the data that is newer still. 


SHARED: This block was filled in response to an ICache miss. It is identical to the copy of data in main memory. 


Note that the LRU array is a bit vector. Bit X in the vector is set if the last access to set X in the tag array hit on 
way zero. The L2 control unit uses this hint to chose the victim block when replacement is required. Replacement 
ordering rules chose the victim block on a priority basis as shown in Table 6.1. LRU is used for the replacement 
choice for all cases where both blocks are in a state other than INValid. 


6.15.2 The L2 Miss Data Pipeline 


Figure 6.6 and what we’ve discussed so far gets us to the command port of the CSW. The memory request will 
then wind its way to the memory controller and either cause a memory fetch or get forwarded to a processor that 
owns a copy of the block. When the data returns it will pass through the L2 update and L1 fill pipeline shown in 
Figure 6.8. There isn’t a whole lot to do in this path. We need to grab the data from the CSW, check and correct 
for any single bit errors, and then forward the data into the BIU port on the processor. 

The LfBuf in Figure 6.8 holds the fetched 64 byte block. The first 32 bytes are forwarded to the SLC unit and 
retimed to be sent into the processor. All 64 bytes are held until they are written into the L2 data array. 

Figure 6.8 ommits a whole lot of detail. The L2 data array does not show details of the mux control, the L1 
to L2 update path, or the address multiplexing for the L2 data arrays. None of these is all that important to the 
speed of L2 miss handling. 


6.15.3 L1 Updates Writebacks and Misses 


So far, we’ve described the path of L2 miss transactions. In all likelihood, at least two out of three accesses to 
the L2 cache will hit. Further, the L1 will occasionally displace dirty blocks into the L2. (Note that the processor 
will never write an L1 data block to the L2 unless it had first read the block into the L1. This means that L1 writes 
to the L2 will always hit in the L2 (since the L1 is a subset of the L2). 

On an L1 read miss, the L1 may need to displace a block from the 32KB 4-way L1 DCache. Further, the read 
miss may require that we displace a block from the L2 as well. This means that the original L1 read miss (a single 
32 byte read transaction) may cause a 32 byte writeback (the L1 victimization), two L2 to L1 probe operations (to 
find out if either of the 32 byte halves of the displaced L2 block are cached in the L1) and between zero and two 
32 byte writeback operations (L1 copies of the displaced L2 block.) Confusing? Let’s try a few scenarios. 
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Figure 6.8: The L2 Update and L1 Fill Pipeline 
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[_0_[ Processor sues Datroam Read raddras Nat BIU pot ——SSSSSSSSSSSSCS*d 
2 [SEC retimes BIT request, sends address to TAG and DAT anays SSCS 
6 | Tag array Tooks up address X. Data array begins data lookup SSS 


Table 6.2: Simple L1 Read Miss — L2 Hit 


Tine 
Processor issues Dstream Read of address X at BIU port 


2 SLC retimes BIU request, sends address to TAG and DAT arrays. 
Sends data to DAT array. 


Tag array looks up address X 


10 Tag is a HIT on way W 
Data array writes first and second data words into way W. 
Data array writes third and fourth data words into way W. 


Table 6.3: Simple L1 Writeback (All L1 writes hit in L2) 


Table 6.2 shows the trajectory of an L1 Dstream miss that hits in the L2. Istream misses are processed identically. 
Note that because of alignment issues between the PCLK and CCLK domain, the actual time line may be shifted 
2 nS later (that is, SLC retiming may happen at time = 4nS) for half of all accesses. 


Figure 6.6 shows the flow for an L1 read miss that requires eviction of an L1 cache block and an L2 cache block. 
Note that in this case the L1 block could map to the L2 block. (This may be impossible, given that we’re using 
a different hash function in the L1 and L2 caches, but I’m not ready to bet on that yet.) The DAT unit ensures 
that any writes arriving from the processor will be checked against the L2 victim address. Writes to the L2 victim 
block will be routed to the WriteBack buffer (and thence to the CSW when the victim data is finally evicted). 


The time between issuing a probe request into the processor’s BIU and the arrival of the response can’t be 
determined a priori, so the table shows the first of two probes completing at time Pl. The writeback of the L1 
victim block may occur at any time between the arrival of the read-miss request and the end of time, but the 
overall operation will not be complete until both probe requests have completed AND any blocks that the L1 probe 
identified as dirty have been loaded into the WriteBack buffer and sent out to the CSW. 


6.15.4 CSW Probe Operations 


From time to time the coherence engines on the CSW will forward probe requests to the PS. Each request is first 
processed by the L2 controller to check for collisions against operations that are currently in flight. Commands are 
processed in order, but not necessarily immidiately. The L2 controller queues up to 26 operations in the incoming 
command queue. Probes are only processed when there are no L2 operations in flight — this is to prevent the huge 
tree of possible interactions between probes and L1/L2 references. 

The L2 controller then sends each probe request to the L2 tag array. In this case, the input to the tag array 
address mux is preempted. (This is why we capture the last DC miss address — we’ll launch the DC tag query 
when the probe is complete.) If the L2 tag compare indicates a MISS, the controller will send a PROBENOHIT 
as appropriate. If the L2 tag compare hits, then we’ll send a L1 probe request to the core. On completion of the 
core intervention, the controller will update the L2 data block with L1 writeback data and send the L2 data out to 
the CSW as necessary. (This latter operation is identical to a victim eviction with an L1 merge and uses the same 
buffers and machinery.) 


Probe operations are described in detail in Section 6.22. 
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Processor issues Dstream Read of address X at BIU port 
SLC retimes BIU request, sends address to TAG and DAT arrays 


Tag array looks up address X. Data array begins data lookup 


| 90 
P= | 
| 68 
eed Tag is a miss on both ways. Way W is selected as victim. 
| id | 
ote 


Data array muxes data from way W (all 8 words) into the Writeback Buffer. 


Drive RDEX (Dstream) or RDS (Istream) onto CSW as appropriate. 


2 
10 
14 
T CSW returns first 16 bytes (DAT|0], DAT[1]) of data to Data array Fill Buffer 
T+4 | CSW returns DAT[2], DAT[3] to Fill Buffer. 
SLC retimes DAT[0], DAT{[1] to processor BIU. 
DAT(0], DAT{[1] written to data array. (This may be delayed if the L2 data array is busy.) 
Update TAG array with current MOD STATE. 
T+8 | SLC retimes DAT|2], DAT|3] to processor BIU. 
po DAT[2], DAT[3] written to data array. 
Ty 


DAT|4], DAT|5] written to data array. 
DAT|6], DAT[7] written to data array. 


Table 6.4: L1 Read Miss, L2 Read Miss, Victim block is in INVALID or SHARE state 


0 _| Processor isues Dstroam Read ofaddres Xat BU pot SSS 
[2 [SEC retimes BIU request, sends address to TAG and DAT arrays SSS 
[6 | Tag array Tooks up address X. Data array begins data lookup 

Tag is a miss on both ways. Way W is selected as victim. Victim block address is V. 
Viaaadl Data array muxes data from way W (all 8 words) into the Writeback Buffer. 

Drive RDV (Dstream) or RDSV (Istream) onto CSW as appropriate. 
eed Send probe for block V to processor BIU 

Probe completes in processor — invalidate the block, returns DIRTY if block must be written 

back. 

Writeback operations from BIU to L2 data array begin after probe response. 

SLC retimes writeback data, inserts data into WriteBack buffer. (Overwrites L2 data.) 

Send probe for block V+32 to processor BIU 


2 
10 
14 
Pl 
P2 Probe completes in processor. 
If neither block is DIRTY, send WBCANCEL to CSW. 
P2+4 | Dump WriteBack buffer to CSW to complete RDV or RDSV writeback portion. 
T CSW returns first 16 bytes (DAT|0], DAT[1]) of data to Data array Fill Buffer 
+4 
+8 


CSW returns DAT|2], DAT[3] to Fill Buffer. 
SLC retimes DAT[0], DAT{[1] to processor BIU. 
DAT{0], DAT[1] written to data array. (This may be delayed if the L2 data array is busy.) 
Update TAG array with current MOD STATE. 

SLC retimes DAT[2], DAT|3] to processor BIU. 

Fiore DAT[2], DAT[3] written to data array. 

| T+12 | 


+12 | DAT|4], DAT[5] written to data array. 


DAT|6], DAT[7] written to data array. 


T 
T 
T 


Table 6.5: L1 Read Miss, L2 Read Miss, Victim block is EXCL, DIRTY, or UPDATED 
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Processor issues Dstream Read of address X at BIU port. 

L1 Victim address is L. 

Tag array looks up address X. Data array begins data lookup. 

SLC may send write operations for L at any time. 

Tag is a miss on both ways. Way W is selected as victim. Victim block address is V. 
Data array muxes data from way W (all 8 words) into the Writeback Buffer. 


Drive RDV (Dstream) or RDSV (Istream) onto CSW as appropriate. 
Send probe for block V to processor BIU. 
Writes from SLC to address V are all routed to the WriteBack buffer. 


Writes from SLC to address L are all routed to the L2 data array as a normal L1 write. (See 
Table 6.3.) 

Probe completes in processor — invalidate the block, returns DIRTY if block must be written 
back. 

Writeback operations from BIU to L2 data array begin after probe response. 

SLC retimes writeback data, inserts data into WriteBack buffer. (Overwrites L2 data.) 
Send probe for block V+32 to processor BIU 


2 
10 
14 
Pl 
P2 Probe completes in processor. 
If neither block is DIRTY, send WBCANCEL to CSW. 
P2+4 | Dump WriteBack buffer to CSW to complete RDV or RDSV writeback portion. 
T CSW returns first 16 bytes (DAT|0], DAT[1]) of data to Data array Fill Buffer 
+4 [ 
+8 


CSW returns DAT|2], DAT|3] to Fill Buffer. 
SLC retimes DAT[0], DAT{[1] to processor BIU. 
DAT(0], DAT{[1] written to data array. (This may be delayed if the L2 data array is busy.) 
Update TAG array with current MOD STATE. 

SLC retimes DAT[2], DAT|3] to processor BIU. 

fel DAT([2], DAT[3] written to data array. 

| T+12 | 


+12 | DAT|4], DAT[5] written to data array. 


DAT|6], DAT[7] written to data array. 


T 
T 
T 


Table 6.6: L1 Read Miss, L2 Read Miss with L1 and L2 evictions 
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6.15.5 Putting It All Together 


We’re pretty tight for space in the processor segment. In particular, we’re limited as to how much room we 
have for queues and attendent state aside from the 256KB worth of data in the L2 arrays. Figure 6.9 shows the 
major components of the L2 portion of the processor segment and the total bytes of RAM, buffer, and register 
storage for each. Earlier sections have described the significant features of the tag and data arrays. The controller 
is responsible for all command parsing from the CSW and the MIPS BIU, as well as mux control and data steering 
in the tag and data arrays. 

The controller segment also initiates and responds to I/O space accesses (Section 6.18) and interrupts (Section 
6.9). 


6.15.6 The SLC (slick) and Processor Access Stalls 


The SLC is responsible for retiming requests and responses between the PCLK (processor clock) and CCLK 
(central clock) domains. It also handles all processor stall operations. 

While cache fills and victimizations are in progress, we occasionally need to prevent the processor from issuing 
new requests to the L2 data or tag arrays. There are two levels of stall operation. The first prevents all processor 
requests and is used in the early stage of a fill or probe operation to allow the CTL unencumbered access to the 
tag and data arrays. The second level allows write operations to propagate through, but enqueues up to two read 
operations in the SLC’s pending read queue. This is used in the later stage of fill and probe operations to allow 
invalidate writebacks to wend their way into the DAT array’s writeback buffer. 

The SLC ARdy state machine that implements “first level” stall and monitors stall requests from the DAT and 
CTL units. Note that cac.cpu_ARdy_pr and cac_cpu-WDRdy_pr are wired together. 


6.16 Initial Program Load and Processor Start-up 


The processor segment implements the address request half of the initial program load process described in 
Section 12.8. 


6.17 Memory and IO Ordering Rules and Behavior 


Here are the simple rules for ordering behavior from the point of view of the processor and the programmer: 


1. To ensure that any memory reference A becomes apparent to other processors or an IO device before some 
other memory reference B, the programmer must insert a SYNC instruction between A and B. The sequence 
READ Mem|[X]; WRITE Mem[Y] may be executed in inverse order if X and Y are not in the same 32 
byte L1 block. 


2. IO WRITE references will complete in order. The sequence READ IoSpace[X]; WRITE IoSpace[Y] 
may reorder to WRITE IoSpace[Y]; READ IoSpace[X] but WRITE IoSpace[Y]; Read IoSpace[X] 
will never reorder. That is, READ operations to IO space will be deferred until all IO and Memory space 
writes have completed and become apparent to the rest of the ICE9. 


3. IO WRITE and IO READ operations to CacLoc registers (the ICR registers, the CAC ECC Control registers, 
the SPCL register window, and the Interrupt Delivery registers) may re-order with respect to each other 
and with respect to IO WRITE operations to other parts of the address space. This means that SYNC 
instructions should be used to guard ordering for all such operations to the local control registers. Memory 
write operations, however ARE ordered with respect to IO WRITE operations to any of these registers. 


4. The ICE9 MIPS processor implements “hits under misses.” This means that reads may re-order relative to 
each other in the absense of a SYNC or other ordering event. In particular, no ordering of READs is implied 
by the code in Figure 6.10 even if a[] and bl] are written by a process that inserts a SYNC between the update 
of the two. Figure 6.11 shows that the read-order can be enforced by making the second read operation 
depend on the result of the first. (A SYNC would work too.) 


The CAC unit processes IO write operations in order. The CAC also ensures that IO writes won’t re-order 
relative to IO reads. Some may interpret the MIPS ordering rules as requiring a sync between IO writes and 
subsequent IO reads and vice versa. However, it is clear that many Linux IO drivers take liberties with the 
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Figure 6.9: Processor Segment L2 Major Units 
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int j, k; 

Process WRITER 
b[1] = new_B_value; 
SYNC() ; 
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int j, k; 

Process WRITER 
b[1] = new_B_value; 
SYNC() ; 
a[1] = new_A_value; 
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bli]; // j may see new_A_value while k sees old_B_value 


Figure 6.10: Unordered Reads 


if(j > 3) k = bli]; // if j sees new_A_value then k must see new_B_value 


Figure 6.11: Read Order Enforced by Dependency (assumes no re-ordering of operations by the compiler.) 
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ordering rule and work better if we can garuntee that IO reads and writes don’t pass each other. Therefore, the 
CAC unit will enqueue all IO reads from the processor and will not pass them on to the CSW until all previously 
issued IO writes have been completed. An IO write completes when the target device (the device owning the 
register to which the write is directed) has issued the companion RDIO operation to get the WTIO data. (See 
Section 6.21.6.) 

The MIPS core may emit up to 5 IO writes at a time. The CAC handles only one IO write at a time, so there 
is a queue in the CMX (command multiplexer) unit that ensures IO writes are completed in order. IO writes may 
pass L1/L2 writeback operations, but this will not affect the “observed” order of memory updates vs. IO writes, 
as the cache coherence mechanisms are such that the newly written memory data will be observed by any devices 
that could observe the newly written IO data. 

It should be noted that the CAC enforces IO write ordering and IO write-vs-read ordering as noted above. 
However, IO writes to the SPCL delivery addresses or to the INT delivery register may pass other IO writes in 
flight. To ensure that SPCL and INT IO operations do not pass earlier IO transactions, applications should use a 
SYNC instruction as a barrier before the SPCL or INT op where necessary. 


6.18 I/O Accesses and Address Decoding 


The CAC unit processes IO read operations in order, and won’t reorder them relative to other read operations. 
IO reads may be reordered relative to L1 to L2 writeback operations as they are processed by the CAC, but the 
apparent order to all other devices will not violate ICE9 ordering rules. See Section 6.17. 


6.18.1 CAC Local IO Registers 


There are only a few registers local to the CAC. All are directly accessible only by the local processor. The 
addresses and register layouts are described in 7.18 on page 444. 


6.18.2 CAC Remotely Accessible IO Registers 


Currently there are no remotely accessible IO registers other than those provided on the SCB. 


6.19 Interrupts, Again 


We've talked about interrupts in a number of places. This is the final resting place of all interrupt controversy. 


6.19.1 CPU Interrupt lines 


Each CPU has 8 interrupts visible to software in the R-CpuCause_IP register. They are defined as follows: 


Interrupt 
CPU internal performance counters. 


CPU timer interrupts. Internal to each CPU core 
Polled, errors and slow devices, or externally vectored by interrupt cause register. (see 7.18.6) 
Vectored by interrupt cause register. Kernel assigns for DMA engine. 


Vectored by interrupt cause register. Kernel assigns for PCI-Express. 


Vectored by interrupt cause register. Kernel assigns for inter-processor interrupts. 
Software interrupt. Internal to each CPU core. Asserted and cleared by writing R-CpuCause_IP[1]. 
Software interrupt. Internal to each CPU core. Asserted and cleared by writing R-CpuCause_IP[0]. 


6.19.2 The Interrupt Cause Registers 


Each PS has a bank of interrupt cause registers, R-CacLocIntCr[7:0]. Each ICR is 64 bits wide and corresponds 
to one of first four SI_Int level sensitive interrupts: ICRO and ICR1 to IRQ2, ICR2 and ICR3 to IRQ3, etc. The 
low 8 bits of the ICR contain the “reason” reported for the corresponding interrupt. Bit 8 of the ICR indicates an 
“overflow” condition (described below). Bit 9 indicates that the corresponding interrupt is asserted. The remaining 
bits are read as 0. 
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When the interrupt handler wishes to dismiss an interrupt, it must write a 1 to bit 9 of the related ICR. This 
will clear the interrupt cause register and deassert the related interrupt. 

Finally, we have the problem of two interrupts arriving to write the same ICR before the first one has been 
handled and dismissed. In this case, the second interrupt request will set the OVERFLOW bit in the target ICR. 
No other bits in the ICR are affected by the second request. This means that software must poll all possible origins 
of requests to a given ICR whenever the overflow bit is set. 

The Interrupt Cause Registers are arranged as 64 bit I/O registers in each processor’s private address space. 
(That is, no processor can directly access another processor’s ICRs. For an explanation of indirect access, see 
Section 6.19.4.) 


6.19.3 The CSW INT Transaction and Writing the Interrupt Cause Registers 


The CSW INT command (see Section 7.10.5) appropriates the address field of the address/command “bus” 
to carry the interrupt cause and a choice of which interrupt to assert and which ICR to write. Bits 10:8 of the 
incoming “address” select the ICR from the set of 8 ICRs. Bits 10:9, by implication, select which interrupt will be 
asserted. Bits 7:0 are written to the appropriate ICR. A processor may deliver an interrupt to itself. 


6.19.4 Interprocessor Interrupts 


Any processor can send an interrupt to any other processor via the interrupt delivery register. Writes to 
R_CacLocldr will cause a CSW INT to the appropriate destination node. The IDR is described in Section 7.18.7. 
To deliver an interrupt to processor X, a processor Y will write X’s bus stop number, the index into X’s set of 
ICRs, and a reason code. The PS interface to the CSW will convert this I/O write to a CSW INT transaction. By 
convention, interrupt input 0 (IRQ2) is used for inter-processor interrupts. 

Note that this mechanism allows any processor to spoof interrupts from any device. That may come in handy 
some day. 


6.19.5 Machine Check Interrupts 


This section is obsolete — we have no “machine check” interrupt. 


6.19.6 “Slow” Interrupts 


Some ICE9 on-chip components need to originate interrupts, but don’t have a direct or convenient path to the 
CSW (where they could originate an INTR command). To accomodate this the OCLA Lac, PMI, SCB, FL, DMA, 
FSW, UART and two COH units each have an interrupt wire they can tug on to indicate a need for service or the 
occurence of an error condition. 

These interrupt signals are routed through the CSW to each of the six L2 Cac interfaces. Each Cac may select 
which of the interrupt sources may cause an interrupt to be signaled with the Slow Interrupt Select register. If an 
interrupt is asserted and it is also selected (enabled) by the R-CacLocSlIIntSel register, processor interrupt input 3 
(IRQ5) is asserted and remains asserted until the interrupt condition is cleared. (See Section 7.18.8.) The assertion 
state of each of the incoming interrupts may also be read from the R-CacLocSIIntSel register. 

In addition to the slow interrupts from other devices in the ICE9, the R-CacLocSIUntSel register contains two 
bits indicating the detection of a correctable or uncorrectable ECC error. In the event of an uncorrectable error, 
the CAC will assert the slow error interrupt to the processor. Correctable ECC errors will be signalled as INT[3]. 
Both error conditions may be cleared by writing a 1 to the appropriate bit in R-CacLocSIIntSel. 


6.19.7 Delivering Interrupts to Other Processors 


Each ICE9 processor can deliver an interrupt to the ICR of any other processor via the outbound interrupt 
delivery register R-CacLocIntDel. See Section 7.18.7. Writes to this register become INT requests on the CSW. 
6.20 Error Correction, Detection, Control, and Testing 

All data passing over the CSW is protected by ECC, as is all data and tag information in the L2 caches. 


Uncorrectable errors are signalled by asserting the processor’s non-maskable interrupt. Correctable errors are 
signalled by a slow interrupt. (See 6.19.6, and 7.18.8.) 
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Each CAC, in its own local IO CSR space, provides five registers for control and monitoring of ECC generation, 
and detection. They follow the scheme described in 12.4. 


6.21 Processor/L2 Transactions — NittyGritty Details 


This section outlines the flow of data and sequence of control actions for all of the possible transactions that 
could take place between the processor and the L2 or I/O system. 

Most of the cases enumerated here require a lookup in the L2 tag array. In the case of D-stream accesses, we 
accelerate the tag lookup by launching a speculative lookup using the address sent to the L1 D-cache. When the 
BIU state machine sends the actual miss request to the cache segment, we check the BIU address against the last 
speculative miss address. If they match, we make use of the earlier tag lookup result. Otherwise, we send the BIU 
address through the tag lookup pipeline. In the descriptions that follow, we lump all this tag-lookup machinery 
into the notion of “performing an L2 tag lookup” without rehashing the details each time. 

All 32 byte fills from the L2 to the L1 are delivered in “best word first” order. Responses to probes are delivered 
from the processor to the L2 in “word 0 first” order. In all cases, probe addresses sent from the L2 to the processor 
will set address bits [4:3] equal to 0. 


6.21.1 Processor L1 Cache Read Miss 
6.21.1.1 I-Stream Read L1 Miss, L2 Hit 


I-stream read L1 misses are recognized by the assertion of cpu_cac_reqAValid_pr, cpu_cac_reqBurst_pr, 
and cpu_cac_reqInstr_pr. (If burst is not asserted, then this I-stream access is bypassing the L1 cache. See 
Section 6.21.3.) 

The TAG unit will signal an L2 hit to the CTL after performing an L2 tag lookup on the BIU request. (The 
SLC will multiplex the BIU address onto the TAG index and address comparison inputs.) The CTL then directs 
the DAT unit to perform a 32 byte read of the appropriate block. The DAT sends the 32 byte block to the SLC. 
The CTL tells the SLC to sequence a burst read back to the processor’s cpu_cac_rtnRData_pr bus. 


6.21.1.2 I-Stream Read L1 Miss, L2 Miss 


I-stream read L1 misses are recognized by the assertion of cpu_cac_reqAValid_pr, cpu_cac_reqBurst_pr, 
and cpu_cac_reqInstr_pr. (If burst is not asserted, then this I-stream access is bypassing the L1 cache. See 
Section 6.21.3.) 

The TAG unit will signal an L2 miss to the CTL after performing an L2 tag lookup on the BIU request. (The 
SLC will multiplex the BIU address onto the TAG index and address comparison inputs.) The TAG unit also 
reports the choice of the victim block and its state to CTL. 

If the state of the victim block is SHARED, or INVALID the CTL will send a RDS command to the CSW. 
When the data returns, the CTL will route the data through the DAT unit to the SLC. The SLC will retime the 
first 32 bytes of the return data onto the cac_cpu_rtnRData_pr bus. 

If the state of the victim block is EXCLUSIVE, MODIFIED, or UPDATED, The CTL will direct the TAG unit 
to send a probe request to the processor via the cac_cpu_prb* inputs. (See Section 6.21.8.) At the same time, the 
CTL will send an RDSV command (read shared with victim) to the CSW. When data returns, it will be routed to 
the BIU in the same manner as for an RDS transaction. 


6.21.1.3 D-Stream Read L1 Miss, L2 Hit 


D-stream read L1 misses are recognized by the assertion of cpu_cac_reqAValid_pr, and cpu_cac_reqBurst_pr, 
and the deassertion of cpu_cac_reqInstr_pr. (If burst is not asserted, then this D-stream access is bypassing the 
L1 cache. See Section 6.21.3.) 

The TAG unit will signal an L2 hit to the CTL after performing an L2 tag lookup on the BIU request. (The 
CTL may make use of the tag lookup performed using the “fast path” described above.) The CTL then directs the 
DAT unit to perform a 32 byte read of the appropriate block. The DAT sends the 32 byte block to the SLC. The 
CTL tells the SLC to sequence a burst read back to the processor’s cpu_cac_rtnRData_pr bus. 
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6.21.1.4 D-Stream Read L1 Miss, L2 Miss 


D-stream read L1 misses are recognized by the assertion of cpu_cac_reqAValid_pr, and cpu_cac_reqBurst_pr, 
and the deassertion of cpu_cac_reqInstr_pr. (If burst is not asserted, then this D-stream access is bypassing the 
L1 cache. See Section 6.21.3.) 

The TAG unit will signal an L2 miss to the CTL after performing an L2 tag lookup on the BIU request. (The 
CTL may make use of the tag lookup performed using the “fast path” described above.) The TAG unit also reports 
the choice of the victim block and its state to CTL. 

If the state of the victim block is SHARED, or INVALID the CTL will send a RDEX command to the CSW. 
When the data returns, the CTL will route the data through the DAT unit to the SLC. The SLC will retime the 
first 32 bytes of the return data onto the cac_cpu_rtnRData_pr bus. 

If the state of the victim block is EXCLUSIVE, MODIFIED, or UPDATED, The CTL will direct the TAG unit 
to send a probe request to the processor via the cac_cpu_prb* inputs. (See Section 6.21.8.) At the same time, the 
CTL will send an RDV command (read shared with victim) to the CSW. When data returns, it will be routed to 
the BIU in the same manner as for an RDEX transaction. 


6.21.2 Processor L1 Cache Write Miss 


All L1 misses caused by a store instruction are converted into L1 read miss requests by the BUI. See Section 
6.21.1. 


6.21.3 Processor L1 Cache Bypass Read to Cacheable Memory 


Earlier versions of this specification indicated that the L2 segment would support 64 bit reads (that is, non-burst 
reads) to memory. This is no longer supported. Such reads to memory produce an unpredictable result. (Such 
reads can only be caused by certain accesses to non-cached memory space.) 


6.21.4 Processor L1 Cache Bypass Write to Cacheable Memory 


Earlier versions of this specification indicated that the L2 segment would support 64 bit writes to memory. This 
is no longer supported. Uncached writes to memory space produce unpredictable results. 


6.21.5 Processor I/O Read 


Processor read operations to non-cacheable addresses (addresses with the MSB of the physical address set) 
are passed on to the CSW or to the processor segment’s local registers. Such operations are recognized by the 
assertion of cpu_cac_reqAValid_pr, cpu_cac_reqAddr_pr[35], and the deassertion of cpu_cac_reqWrite_pr. 
If cpu_cac_reqBurst_pr is asserted, the operation will return 0 for all 32 bytes in the burst. 

In the case of local register read operations (the address falls in the range of this PS segment’s I/O range, or 
in the CPULOC I/O range — see Section 16.6.6) the CTL will select the appropriate register and steer its data to 
the SLC. The SLC will sequence the data onto the BIU data pins. Note that the CTL is responsible for address 
decoding and sequencing operations. The interrupt reason registers are in the CPULOC I/O range. 

I/O accesses that are outside the CPULOC I/O range must be sent to the appropriate device. The CTL selects 
the device and initiates the CSW RDIO transaction. When data returns, the DAT unit notifies the CTL and the 
CTL steers the incoming data from the DAT unit to the SLC where it is sequenced onto the BIU. For operations 
sent to the CSW, the byte enable vector cpu_cac_reqBE_pr|[7:0] is passed along to the CSW. In general, this is 
irrelevant to I/O registers created by SiCortex, as we prohibit reads from causing side-effects. However, we don’t 
own all the I/O devices on the chip, so we must provide machinery that honors the size of I/O read requests. 


6.21.6 Processor I/O Write 


Processor write operations to non-cacheable addresses (addresses with the MSB of the physical address set) are 
passed on to the CSW or to the processor segment’s local registers. Such operations are recognized by the assertion 
of cpu_cac_reqAValid_pr, cpu_cac_reqAddr_pr[35], and cpu_cac_reqWrite_pr. If cpu_cac_reqBurst_pr is 
asserted, the operation will return 0 for all 32 bytes in the burst. 

In the case of local register read operations (the address falls in the range of this PS segment’s I/O range, or 
in the CPULOC I/O range — see Section 16.6.6) the CTL will select the appropriate register and steer its data to 
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the SLC. The SLC will sequence the data onto the BIU data pins. Note that the CTL is responsible for address 
decoding and sequencing operations. The interrupt reason registers are in the CPULOC I/O range. 

I/O accesses that are outside the CPULOC I/O range must be sent to the appropriate device. The CTL selects 
the device and initiates the CSW WTIO transaction, and loads the write data into the WtIoDat register in the 
DAT unit. When the targetted unit responds with the completing RDIO transaction, the CTL unit will direct the 
DAT unit to drive the contents of WtloDat onto the CSW data bus. When the DAT unit recieves a data grant, 
the CTL will complete the write operation. Since the processor may queue up to 5 I/O writes for processing, such 
requests enter an I/O write queue in the SLC and are processed in order. For operations sent to the CSW, the 
byte enable vector cpu_cac_reqBE_pr[7:0] is passed along to the CSW. 

Once the I/O write has completed, the CTL unit sends a write buffer acknowledgement back to the processor 
through the SLC via the cac_cpu_wbAck_pr signal. 


6.21.7 Processor L1 Eviction 


A processor may evict an L1 block at any time. It signals a block writeback with the assertion of cpu_cac_reqAValid_pr, 
cpu_cac_reqWrite_pr, and cpu_cac_reqBurst_pr. Block writes to I/O space are ignored. 

When the L1 eviction address passes from the SLC to the CTL, the CTL will setup the DAT pipeline for an L1 
writeback. When the pipeline is ready, the CTL will tell the SLC to assert cac_cpu_LWDRdy_pr. The processor 
then sources data onto the cpu_cac_WData_pr bus which is retimed by the SLC unit to 128 bits. The DAT 
pipeline appends ECC bits and writes the 32 byte block into the L2 data array. The Tag unit changes the state of 
the block to “UPDATED.” 


6.21.8 L2 Probe to Processor 


The L2 cache may initiate a probe request to the processor for two reasons. First, the L2 may need to displace 
a block due to a L2 miss caused by a processor request. (See Section 6.21.1.) Second, the L2 may launch a probe 
into the L1 to retrieve a potentially dirty block in response to a CSW PRBWIN command. In either case, the 
actions by the L2 segment and the processor are identical. Note that in both cases, the CTL will send in TWO 
probe requests, one for each of the two 32 byte blocks that map to the 64 byte block of interest. They will be sent in 
order and the second will not be sent until after the first probe has been acknowledged via the cpu_cac_invAck_pr 
signal. 

The probe is launched by the CTL. CTL sends a probe request to the SLC unit which drives the probe address 
onto cac_cpu_prbAddr_pr, and asserts cac_cpu_prbReq_pr. At some later time, the processor will respond with 
some combination of assertions of cpu_cac_invHit_pr, cpu_cac_invDirty_pr, and cpu_cac_invLock_pr along 
with the assertion of cpu_cac_invAck_pr. 


6.21.8.1 Probe Hits on Clean Block 


When the probe hits on a clean block, the processor will assert cpu_cac_invAck_pr while deasserting cpu_cac_invDirty_pr. 
In this case, the CTL will not anticipate a writeback for the block. The CTL will release the DAT data path to 
send probe or victim data out to the coherence widget or probe requester. If both probes for a block return clean 
and the block is clean in the L2, the CTL will initiate a WBCANCEL operation in the event of a writeback, or 
complete the PRBINV operation. 


6.21.8.2 Probe Hits on Dirty Block 


When the probe hits on a dirty block, the processor will assert cpu_cac_invAck_pr and cpu_cac_invDirty_pr. 
In this case, the CTL expects a writeback for the block. Before launching the probe, the CTL pre-arms the DAT 
unit to prepare it to accept writeback data. With the arrival of invDirty, the SLC alerts the DAT unit that 
writeback data is arriving. Once both probes have completed, the CTL unit will complete the writeback or probe 
operation by signalling the DAT unit to forward the updated data to the CSW. 


6.21.8.3 Probe Misses in L1 


When the probe misses in the L1 D-cache, the processor will assert cpu_cac_invAck_pr while deasserting 
cpu_cac_invDirty_pr. (This is indistinguishable from a hit on a clean block. The cpu_cac_invHit_pr signal may 
be used as a hint for performance counters, but is not to be used in making protocol decisions.) In this case, the 
CTL will not anticipate a writeback for the block. The CTL will release the DAT data path to send probe or victim 
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data out to the coherence widget or probe requester. If both probes for a block return clean and the block is clean 
in the L2, the CTL will initiate a WBCANCEL operation in the event of a writeback, or complete the PRBINV 
operation. 


6.22 L2 Responses to Probe Requests 


In the responses described below, data is returned — if required — to the appropriate requester via the DAT unit. 
The DAT unit is responsible for sequencing data responses to the CSW and waiting for data grant. 

Note that the cache segment stalls all new read accesses from the processor while probe handling occurs. This 
avoids ships-passing-in-the-night problems with, for instance, a PRBBWT arriving, finding a hit, and then finding 
that the target block has been evicted when the BWT data arrives. Further, when the cache segment responds to 
a PRBBWT with a BWTGO, it will wait until all outstanding L2 fills or IO reads have completed, since the CSW 
port can handle just one block of data coming to a processor in any given cycle. 

All probe handling begins with the dispatch of the operation at the top of the CTL probe state machine. When 
an enqueued probe request is found, the CTL unit causes the SLC to “pause” the processor BIU. Each probe flow 
will wait for the SLC to signal that the BIU is now in the paused state. (The BIU is paused when ARdy is deasserted 
and there are L1 instigated operations currently in flight.) Figure 6.12 shows the probe operation dispatch. 


6.22.1 PRBINV 


The CTL unit will drive the incoming address to the Tag unit. At the same time, the CTL will direct the SLC 
to hold off further processor BIU requests (via the cac_cpu_reqARdy-_pr signal) while the Tag array is occupied. 
The Tag unit will use ctLxxx_Addr_c2a to generate a lookup and set the matching state (if any) to INVALID 
unless the current state is EXCL/DIRTY/UPDATED.! Incoming PRBINV commands that carry a TID that is 
owned by the receiving unit do not update the cache, but send an INVDONE to the originating COH unit. 

The CTL generates an INVDONE response to the CSW when a PRBINV has been handled (or when a PRBINV 
for a TID owned by this unit arrives). The CTL state machine flow for PRBINV is shown in Figure 6.13. 


6.22.2 PRBWIN 


The CTL unit will drive the incoming address to the Tag unit. At the same time, the CTL will direct the SLC 
to hold off further processor BIU requests (via the cac_cpu_reqARdy-_pr signal) while the Tag array is occupied. 
The Tag unit will use ctLxxx_Addr_c2a to generate a lookup and set the matching state (if any) to INVALID. 
The Tag unit will also report the result of the lookup to the CTL. 

If the lookup missed in the Tag, the CTL unit will initiate a PRBNOHIT response to the original requester. 

If the lookup hit in the Tag the CTL will initiate two probes into the L1 cache for the two L1 blocks contained 
in the identified L2 block. After the L1 probes complete (see Section 6.21.8.) the CTL will direct the DAT unit to 
send the L2 block to the requester. 

(Note that if the L2 block was in the SHARED or INV state, we still do the probes. PRBWIN shouldn’t arrive 
for blocks in the SHARED state, but as the L2 ignores the state in this case, we’ll complete the transition. However, 
the hardware is broken at this point.) 

The BRD, WIN, and SHR flows all share a commong writeback flow shown in Figure 6.16. 


6.22.3 PRBBRD 


The CTL will drive the incoming address to the Tag unit. At the same time, it will launch two probe requests 
to the Ll. 

After the probe requests complete, and any data has been written to the L2, the CTL directs the L2 to respond 
with data as for a PRBWIN request. In this case, however, the tag array state remains unchanged. The transaction 
is handled by the CTL PRB state machine as shown in Figure 6.15 


lIn this case, the PRBINV is “stale” and arrived sometime in the past, was neglected until now, and is now applied to a block that 
was recently acquired. 
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Figure 6.16: Common Writeback Flow 
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6.22.4 PRBBWT 


The CTL will drive the incoming address to the Tag unit. At the same time, it will launch two probe requests 
to the L1. The CTL will direct the SLC and DAT units to ignore the return data (if any) from the L1. (The object 
here is to clear the valid bits for the relevant blocks in the L1 cache. 

After the probe requests are launched (but we don’t wait for acknowledgement) the CTL originates a BWTGO 
command on the CSW to prompt the requester to send the block of data. 

When the data arrives, the DAT unit will write the data into the L2 data array. During this operation, the 
CTL will direct the SLC to hold off all BIU write data with the cac_cpu_reqWDRdy_pr signal. 

After the DAT unit signals to the CTL that the transfer has completed, the CTL will send a BWTDONE signal 
to the COH. (This allows the coherence engine to “complete” the write action and t rigger any operations that were 
dependent on the block write. 

Note that only the DMA engine and the PCLe controller can initiate BWT operations, so the PS need only 
provide enough book-keeping slots to keep track of 8 BWT operations at a time. They will always arrive with a 
TID of DMAWTx or PCIWTx where x can range from 0 to 3 inclusive. 


6.22.5 PRBSHR 


A PRBSHR request will arrive when this processor has cached a a block in any state and another processor also 
wishes to cache the block in SHARED state. Figure 6.18 shows the flow. 


6.23 L2 Responses to Other CSW Commands 


6.23.1 PRBNOHIT 


PRBNOHIT arrives in response to a forwarded RDEX, RDV, RDS, or RDSV operation. In this case, the CSW 
immediately drives the appropriate retry (RDEXR or RDSR) operation onto the CSW. 


6.23.2 RDIO 


The CTL unit will check the incoming csw address against the known address ranges fielded by this node. CTL 
will drive the incoming address ctLxxx_Addr_c2a and assert ctl_xxx_IORd_c2a. 

If the address is out of bounds (i.e. does not match any range), the CTL will direct the DAT unit to initiate a 
data response with a data field of all zeros. 

If the address is in bounds, the CTL will drive a unit select signal to the PS unit that owns the registers. The 
target unit will use the ctl_xxx_Addr_c2a signal to select the appropriate register and drive return data to the 
DAT unit. The DAT unit will pass the data on to the CSW. 


6.23.3 WTIO 


The CTL unit will check the incoming csw address against the known address ranges fielded by this node. CTL 
will drive the incoming address ctLxxx_Addr_c2a and assert ctl xxx IOWt_c2a. 

As noted in the L2 Cache chapter, WTIO transactions are double-ended so as to allow the processor node to 
prevent two data blocks from arriving at a CSW bus stop at the same time. On receipt of an WTIO, the CTL unit 
will initiate an RDIO command to the requester using the same TID Ty as the incoming request. The address 
attached to the RDIO is “WTIOADDR”. At some later time, the data will return from the original requester. The 
TID will be matched up against Ty and the data routed to the appropriate destination. 

If the address is out of bounds (7.e. does not match any range), the CTL will direct the DAT unit to drop the 
data into the bit bucket. 

If the address is in bounds, the CTL will drive a unit select signal to the PS unit that owns the registers. The 
target unit will use the ctLxxx_Addr-_c2a signal to select the appropriate register and write the incoming data 
from the DAT unit (on dat_xxx_IOWtDat_c4a) into the target register. 

WTIO operations may not be initiated by either the PClexpress controller or the DMA unit. All WTIO 
operations are initiated by processors. This is important, as the CAC tracks just one WTIO transaction for each of 
the processors. (No processor/L2 complex can have more than one write operation outstanding at a time. Other 
I/O writes from the processor are enqueued in the SLC until resources are available in the CAC to handle them.) 
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Figure 6.17: CTL State Machine Flow for PRBBWT 
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Figure 6.18: CTL State Machine Flow for PRBSHR 
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6.23.4 INT 


See Section 6.19. 
On receipt of an INT command, the CTL unit will send a DONE response via the CSW to the originating node. 


6.23.5 Incoming Data Completing a Memory Read Operation 


When data arrives from the CSW in response to a RDEX, RDV, RDS, or RDSV operation, the CTL unit will 
check the DataOrigin. If DataOrigin is not a coherence widget, the CTL unit will send a PRBDONE request to 
the appropriate coherence widget to complete the transaction. 


6.24 Registers and Definitions 


For details on most of these registers, consult the MIPS 5kf Processor Core Family Software User’s Manual. 

For a whole host of reasons all registers in the L2 cache portion of the processor segment are defined in the 
CSW and Coherence chapter. See Section 7.17. 

The following CPU registers have been modified relative to the m5kf programmer’s manual. The changes are 
bolded in the register descriptions referenced below. 


Register/Field Name 


6.24.1 Package Attributes 
6.24.1.1 Package 


chip_cpu_spec 


Attributes 


-public_rdwr_accessors 


6.24.2 Definitions 
Defines 
CPU 


Low bit of tag hash field 
How many bits in the tag hash index field 


32’d17 TAGHASH1_LO Low bits of the other tag hash field (HASHO XOR HASH1 -> TagIn- 


dex) 


32’°d19 TAG_WIDTH How many bits in the tag itself? 


May 14, 2014 317 Rev 51328 


SiCortex Confidential CHAPTER 6. PROCESSOR SEGMENTS 


6.24.3 Register List 


The following enumeration summarizes the CPU registers. Note the constant used includes the register number 
and select field, the CPOLREG macro may be used to split them up. 


Enum 


CpuCp0 


Attributes 


-allowle 


8’000_0 |} Index [| | Index into the TLB entry 
8’001_0 | Random [| —_| Randomly generated index into the TLB array 


8’002_0 EntryLo0 Low-order portion of the TLB entry for even-numbered 
ee [ee Ld epege enn 
8’003_0 EntryLol Low-order portion of the TLB entry for odd-numbered 
ee | 


8’004_0 | Context =| _| Pointer to page table entry in memory 
8’005_0 PageMask Control for variable page size in TLB entries 


8’010_0 BadVAddr Reports the address for the most recent address-related 

a ee 

Foro | Comt |__| Processor ele comt SSCS 

80120 Entry | High-order portion ofthe TEBenry——SSS~SY 

¥o1s.0 | Compare [|__| 

Solo | Status |__| 

¥ots.0 | Cause [sr 3 

8016.0 | EPC |__| Program counter at Iast exception SSS 

Pvor70__| PRIA [sr ; 
2 : 


Controls the number of xed (wired) TLB entries 
rocessor identification and revision 


H a 
P 
P att j 


FSMD | Config [| Configuration register 
[8090-1 | Config [| Configuration register 1 
7 
a 
[o20_| Watekto || Tew-order watchpotat adress 
F30080 | Watch |] High-order watchpoint address 
[won | XContext |__| Extended- addressing page table contact] 
som Reserved 
[wo16.0 | PanVPO | | Performance vitaal program counter alaress | 
PSone — | Perv POL || Performance virtual program countor address 
[oI | PeriPEA | | Performance physical fective address] 
Fso083 | PerlPEAT [| Performance physical ofective address 
[wo97=0 | Debug [| Debug regiter 
FS aL0—[ DEPC |] Program countar at exception entering Debug Mode 
[ogi | PerkCat | Performance counter interface 
| 8031-1 | PerfCntl =| ~~~—__| Performance counter interface 

[wogi2 | PerkGatd | | Performance counter interface | 
F083 | PerfGnes [| Performance counter interface 
P0320 | BC | | Parity BCC error control and Status] 
F80880 | Cachek [| Cache parity error control and status 
[wo34.0 | Teelo |__| Low-order parton of cache tag inferiscs | 
[oI | Datao || Tsw-order portion of cache data interface 
[036.0 | TegHi [| High-order portion of cache tag interface | 
Pso351 | Data || Hlighorder portion of cache data interfscd 
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8’036_0 ErrorEPC | _| Program counter at last error 


8’037_0 DSAVE | [| Debug Exception Save Register 
Ce a a 


6.24.4 Prefetch Hint Encodings 


The following enumeration is used for the hint field of the pref instruction. 


Enum 


CpuPrefHint 


5’d0 LOAD TWC9A+ | Load Prefetch. Data is expected to be read and not mod- 
ified. If the address translates and misses in L1, read the 
data exclusive into the Ll. 
ee eet ool ORE peel Store Prefetch. Data is expected to be written. Imple- 
LOADSTR TWC9A+ | Load Streamed. Data is expected to be read once and does 
perfor iS ere 
5’d5 STORESTR TWC9A+ | Store Streamed. Data is expected to be written once and 
eee ennnze SY | doctnet ead abe cached. Implemented sae LOAD. 
5’d6 LOADRET TWC9A+ | Load Retained. Data is expected to be read many times, 
versus the LOADSTR stream. Implemented same as 
LOAD. 
5’d7 STORERET TWC9A+ | Load Retained. Data is expected to be written many 
times, versus the STORESTR stream. Implemented same 
as LOAD. 


5’d25 NUDGE TWC9A+ | Writeback Invalidate. Data is not to be used, if dirty 
fe Were oder | writeback to L2; if clean, invalidate. 
5'd26 LOADL2 TWC9A+ | Load Prefetch to L2. Data is expected to be read, but pre- 
pare only L2 cache. If the address translates and misses 
in L1, read the data exclusive into the L2 cache, do not 
change L1 cache. 
5’d27 STOREL2 TWC9A+ | Store Prefetch to L2. Data is expected to be written, but 
fee ere. |lreree | prepare only L2 cache. Implemented same as LOADL2. 
5’d30 PREPSTORE | TWC9A+ | Prepare for Store. Data will overwrite an entire cache 
line, the data can be filled with zeros. Not implemented, 
becomes NOP. 
(else) TWC9A+ | Reserved. Unimplemented in the core and will be NOPed; 
note some unimplemented encodings are remain architec- 
turally defined. 


6.24.5 CPU Performance Counter Events 


The following events are trackable by the CPU performance counters in CPO register 25. These are different 
event encodings from the SCB performance counters. 

ICE9A uses the same encodings as the M5KF core, which unfortunately has different events for each counter. 
CpuCntEvent0 are the encoding for ICE9A’s counter 0, CpuCntEvent1 is for ICE9A’s counter 1. 

ICE9B uses a different enumeration from ICE9A, CpuCntEvents, but sanitized it by applying the same enu- 
meration to both counters and added additional events. 


Enum 


CpuCntEvent0 
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Attributes 


-descfunc 


Constant Mnemonic Product | Definition (For more details, see descriptions in 
CpuCntEvents.) 


FH 


67h05 SCFAIL ICE9A Conditional stores that fail 


PonopwhOa| «ICRA [ Reseed ——SSOSCSCSCSCSCSCSCSCSCSCS 
PoRIFons [| ____[ICEVA_[Resowed SSCS 


GhO6 | BRANCH | TOR9A 
6’h07 ITLBMISS ICE9A ITLB misses 
, 


eh ICE9A 
6’h08 DTLBMISS | ICE9A DTLB misses. 


Enum 


CpuCntEvent1 


Attributes 


-descfunc 


Constant Mnemonic | Product | Definition (For more details, see descriptions in 
CpuCntEvents.) 


67h05 FLOAT ICE9A | Floating point instructions executed. Includes all COP1 
instructions, including loads and stores. 


Pehowhe| —————S*ICHDA [Reseed SSCS 
PORIFGHSE | ___[ICEVA [Rese SSCS 


Enum 


CpuCntEvents 


Attributes 


-descfunc 
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6’hO1 INSFETCH Instructions fetched. Incremented by the number of in- 
fe structions (0,1,2) fetched by the instruction buffer. 
6’h02 ICE9B+ | Load/pref/sync/cache ops. Incremented by one each time 


a load, pref, sync, or cache instrucion is executed. 


6’h03 STORE ICE9B+ | Stores. Incremented by one each time a store instruction 
completes M stage, irregardless of if it has completed stor- 
ing to memory. Note that a store conditional is considered 
executed even if it fails to perform the store due to the 


LL bit being clear. 
6’h04 sc ICE9B+ | Conditional stores. Incremented by one each time a store 
deo conditional, passing or failing, completes M stage. 

6’7h05 SCFAIL ICE9B+ | Conditional stores that fail. Incremented by one each time 
eee ees a store conditional fails. 

6’h06 BRANCH ICE9B+ | Branches executed. Incremented by one each time a con- 
ee | | ina rah intaction exes 
eHOT 
HOS 
6’h09 I-Cache misses. Incremented by each miss in the I-Cache. 


6’h0a INSSCHED | ICE9B-+ | Instructions scheduled. Incremented by one each time an 
[ieee ees instruction is scheduled. 

6’hOb MISPRED ICE9B+ | Branches mispredicted. Incremented by one each time a 
Fimieabsacll caged conditional branch is mispredicted. 


6’h0c FLOAT ICE9B+ | Floating point instructions executed. Includes all COP1 
and COP1X instructions, including floating point loads, 
floating point stores, and floating point conditional 
branches. 

6’hO0d COP2 ICE9B+ | COP2 and COP2X instructions executed. Includes all 
COP2 and COP2X instructions, including COP2 loads, 
COP2 stores, and COP2 branches. 


6’h0e INSDUAL ICE9B+ | Dual issued instructions. Incremented by *two* each time 
an instruction pair is dual issued. 


6’hOf INSEXEC ICE9B+ | Instructions executed. Incremented by the number of in- 
structions (0,1,2) which have completed their execution 
in the integer and floating point units. For this count, an 
instruction is completed if it has passed its M stage with- 
out being killed, or was a SYSCALL, BREAK, SDBBP, 
or trap. A load instruction is considered as executed if it 
compeleted the M stage, even though it may not have re- 
turned data. MDU and floating point instructions are also 
counted as completed when they finish M stage, though 
they may require additional cycles. 


6’h10 DCEVICT ICE9B+ | Data cache line evicted. Incremented by one each time 
a 32-byte line is evicted from the L1 data cache. This 
includes evictions caused by probes. 
6’h11 TLBTRAP ICE9B+ | TLB miss exception traps. Incremented by one on each 
ee OES | is exepton tage 
6’h12 DCMISS ICE9B+ | Data cache misses. Incremented by one on each L1 Data 
ee ee | eee | 
6’7h13 MSTALL ICE9B+ | Scheduling conflict M-stage stalls. Incremented each cycle 
eee” ere the M-stage pipeline is stalled due to scheduling conflicts. 


6’7h14 L2REQ ICE9B+ | Cachable L2 Cache requests. The count increments when 
the load completes, which may be many instructions after 
the load if there are no load data-dependancies. 
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6’7h15 L2MISS ICE9B+ | Cachable L2 Cache requests that miss in local L2. The 
count increments when the load completes, which may 
be many instructions after the load if there are no load 
data-dependancies. 

6’h16 L2MISSALL | ICE9B+ | Cachable L2 Cache requests that miss in all caches and 
fill from memory. The count increments when the load 
completes, which may be many instructions after the load 
if there are no load data-dependancies. 


6’h17 FPARITH ICE9B+ | Floating point arithmetic instructions. Increments for 
each MADD/ MNADD/ MSUB/ NMSUB/ ADD/ SUB/ 
MUL/ DIV/ SQRT/ RECIP/ RSQRT. 


6’7h18 FPMADD ICE9B+ | Floating point multiply-add instructions. Increments for 
each paired instruction; MADD/ MNADD/ MSUB/ NM- 
SUB. 


ase) [___1ro 


6.24.6 SCB Performance Core Events 


The following CPU counter events are trackable by SCB statistical event counting. This table is inserted twice; 
one for each counter, into CpuScbEvent under the mnemonics CO_ and Cl_. For more details on each event, see 
the descriptions in CpuCntEvents. 


Enum 


CpuScbCoreEvent 


Constant Mnemonic Product | Definition (For more details, see descriptions in 
ee re Een Gpucatbnente) nn PNY 
5’h02 INSFETCH_B1 | ICE9B-+ | Instructions fetched bit 1. Multiply by 2 and add bit 0 
Pee eee eel counter for total number of instructions. 

: I 


N a 
N 
5’h04 NSDUAL_B1 ICE9B+ | Dual issued instructions bit 1. Multiply by 2 and add 
bit 0 counter for total number of instructions. (Scaled in 
Pe la driver software, so read PO_INSDUAL instead.) 
N = 
N a 


5’h05 INSEXEC_BO ICE9B-+ | Instructions executed bit 0. 


5’h06 INSEXEC_B1 ICE9B+ | Instructions executed bit 1. Multiply by 2 and add bit 0 
counter for total number of instructions. (Scaled in driver 
software, so read PO_LINSEXEC instead.) 


5’h10 FLOAT_BO ICE9B+ | Floating point instructions executed, bit 0. Note this in- 
ee eres alec cludes all COP1 instructions, including load/stores. 

5S’hil FLOAT_B1 ICE9B+ | Floating point instructions executed, bit 1. Multiply by 
2 and add bit 0 counter for total number of instructions. 
(Scaled in driver software, so read P1-FLOAT instead.) 


5’h12 COP2_B0 ICE9B+ | COP2 instructions executed, bit 0. 
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L2MISSALL 


5’hib FPARITH 
FPMADD 
shidshi] 


ICE9B- 
ICE9B- 
ICE9B- 
ICE9B- 
ICE9B- 
ICE9B- 
ICE9B- 
ICE9B- 


ICE9B- 
ICE9B- 
ICE9B- 


6.24.7 SCB Performance Events 


The following events are trackable by SCB statistical event counting. 


Enum 


CpuScbEvent 


Attributes 


-descfunc 
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Cachable L2 Cache requests that miss in all caches and 


Floating point multiply-add instructions 


8’h00 CYCLES | {| Cpu cycles. Always counts. 
| Ss s«L D-Cache hits. 


8’hO1 


shO2 DCMISS —, 


DCHIT 


L1 D-Cache misses. 


rsnos__ nm ———S—S~dSSS~*dCR Cacho ts. —SSSCSCSCS~—SCS 
Pshot___[TOMISS——__ |__| i FGache misses SSCS 
Psh0s___[ INSTNCOMPLETE |__| Instruction completed_——S—S—SCS 
-____Pisstruction TB hits SSCS 
Pshor____[1TLBMISS—____ |__| Imstruction TLBmissex——SCSC—C~—~—S—SCS 
[Pata TEBits—OSC—~—SCSC~*S 


8’h06 


8’h08 


S7h09 DTLBMISS 4 


8’h0a 


ITLBHIT 


DTLBHIT 


JTLBHIT 


Data TLB misses. 


Joint TLB hits. 


8"hOb JTLBMISS | «| Joint TLB misses. 


8’h0c SLEEP Sleep cycles. Cycles between WAIT instruction and inter- 
rupt or other wakeup. 


shorshor [id Reserved 


a 


Peni [STADLER |__| R-stageppelmesal ————SOSOSCS~—SCSCS 
Pshit____[ STALLR-DM___ |__| R-stage pipeline stall due to dispatch manager | 
Pshi2____[ STALLR-MD____ |__| Re-stage pipeline stall due to multiply/aivide. | 
Pshis____[ STALLR_CP____ |__| R-stage pipeline stall due to COP condition code____| 


8’hl4 STALLR_DAT R-stage pipeline stall due to data dependency. Includes 
data not ready, bypass not possible, and pending write- 
back stalls. 


8’h17 STALLE | _—_—i[_E-stage pipeline stall. 
8’h18 STALLE_DCPRB |_| E-Stage DCache pipeline stall due to probe. 


8’h19 STALLE_DCNPRB E-Stage DCache pipeline stall due to non-probe. Sources 
include waiting for another fill, eviction buffer empty, 
read-after-write hazard prevention, etc. 
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8’hlb 


8’h26 
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M-stage pipeline stall due to MMU. 
M-stage pipeline stall due to coprocessor 
M-stage pipeline stall due to data. 


Includes coproces- 


sor data delivery conflicts, load/store data not ready, and 
WAW hazard delays. 


Peis | PROBE Probst 

[eng2 [PROBA [| Probes that 

Shag PROBE-DIRTY | [ Probes that hit dim E—___] 

sng PROBE=LOOK [| Probes that clear the lock big] 

[sha PROBE_WATT | [Cycles L2 is waiting fora probe to complete] 
8°38 | LLHOLDOFF  —s{|_—Sss Cycles the LL Timer is non-zero. 

RIN [Read return eyeles 


=e 


8’h4l T'NL2_10 
8’h42 TNL2_HIT 


A A 


Bhd 


a esl 


A] A] A] 


NFR_PS5 

8’h58 NFR_IO 

DQr 

RDQIS 

RD? 


SEEEE 


: 


3 


RDQS 
WR 
WROIS 
WRQB 
WROI 
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Read return came from local L2 cache 


Read return for I/O space. (Physical addr [35] set.) 


Read return did not come from local L2 cache 
Read return from local L2 exclusive. 
Read return from local L2 shared. 


Read return 


from local L2 dirty. 


Read return from local L2 updated. 


Read return 


from remote L2 1 


Read return from remote L2 2 


Read return from remote L2 4 


Read return 
Read return 


Read return from remote L2 3. 


from remote L2 5 
for IO transaction. 


Read queue entry 1 occupied 
Read queue shadow entry 1 occupied 


Read queue entry 3 occupied 


Read queue entry 2 occupied 


Write queue shadow entry 2 occupied 
Write queue entry 3 occupied 


Write queue entry 4 occupied 


Interrupt 0 cycles. Cycles cpu interrupt #£0 asserted (not 
occurances), ignoring mask bit. 


Interrupt 5 cycles. 


324 


Write queue entry 2 occupied. 
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Sh76 INTO a 
ShTT INT? -_Pinterpt Teyeles 


8’h78 INT Interrupt cycles. Cycles asserted (not occurrences) across 
all types, ignoring mask bits. 


8’hbO IFETCHWT Cycles of I-Stream Fetch Wait. Indicates the pipeline was 
empty, and data was not delivered by ICache. Generally 
indicates [Cache miss delays or ITLB miss delays. 


S10] IFETCHWTS 
Pehbs___[IFETCHWT21___ |_| AIFETOHWT of 3=2f cycles. SCS 
Pshbt____ [IFETOMWTs2___[___| AIFETOHWT of 5= 32 cycles. SCS 
Pehbs___[IFETCHWTa8___[____ | AIFETOHWT of 3= eyes. —SCSCS—S~SCS 
Pshbs____[ TFETOHWT61__[____| AIFETOHWT of = 6f cycles. SS 
[Shb7___ | IETCHWT96___[____ | AIFETOHWT of 5= 06 eveles.__—SsSCSCSC~*d 


8’hb8& DATAWT Cycles of Data Fetch Wait. Indicates a instruction was 
stalled waiting for source registers. Generally indicates 
DCache miss delays, DTLB miss delays, or other data 
dependant delays. 


Psnbo___ | DATAWTS____[___| ADATAWT of >=Saycles SSCS 
[Shba___ | DATAWTIG___[ | ADATAWT of 5=I6 yes ———SCSC—SCS 
Pshbb____| DATAWT24____ |__| ADATAWT of >= 24 eyeles. SCS 
She | DATAWT32__[___| ADATAWT of 5= 82 cycles. —SC—CS—SCSCS 
[Shbd____ | DATAWT48___[_ | ADATAWT of = a8 cycles. SCS 
[Shbe | DATAWT61__[__[ ADATAWT of 5=6freyeles. SCS 
[Shr ___| DATAWT96__[__ | ADATAWT of = 95 cyeles.—SSCSCSC~—~C~CS 


(Below events CO-DF correspond to the Cpu’s inter- 
nal performance counter 0 events, though with dif- 
ferent numbering. They only count in user, super- 
visor, or kernel mode, as programmed based on the 
R_CpuPerfCount|0]/CP0 Reg25 register. 

The CPU internal counters may increment by 2 or 3 for 
some events. As the SCB can only increment by one, 
these events are split into BO and _B1 events. Count both 
then present to the user _B1*2+_B0, the result should be 
similar to the CPU internal count for the same event. 

In ICE9A for these SCB events to increment one of the 
Cpu’s internal performance counters must be enabled. 
This restriction is removed in ICE9B.) 


8’hc0(-8’hdf) | Co ICE9B+ | Core Perf 0 ENUM:CpuScbCoreEvent. See above note. 
See the CpuScbCoreEvent enumeration; it is inserted here 
to avoid duplication in this table. 


PO_CYCLES ICE9A Perf 0 Cycles. 
PO_INSFETCH_BO ICE9A Perf 0 Instructions fetched bit 0. 


8’he2 PO_INSFETCH_B1 ICE9A Perf 0 Instructions fetched bit 1. Multiply by 2 and add 
bit 0 counter for total number of instructions. 


PO_INSSCHED ICE9A Perf 0 Instructions scheduled. 


N 
IN 
8’he4 PO_LINSDUAL_B1 ICE9A Perf 0 Dual issued instructions bit 1. Multiply by 2 and 
add bit 0 counter for total number of instructions. (Scaled 
— if | in driver software, so read PO_INSDUAL instead.) 
IN 
IN 


PO_INSEXEC_BO ICE9A Perf 0 Instructions executed bit 0. 


8’hc6 POL 


SEXEC_B1 ICE9A Perf 0 Instructions executed bit 1. Multiply by 2 and add 
bit 0 counter for total number of instructions. (Scaled in 
driver software, so read PO_LINSEXEC instead.) 


PO_LOAD ICE9A | Perf 0 Load/pref/sync/cache ops. 
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shoes | —SC*dICRQA [Reserved SSCSCSCSCSCSCSCSCSCSCSCSCSC~*d 


(Below events 30-3F correspond to the Cpu’s internal per- 
formance counter 1 events, though with different number- 
ing. They only count in user, supervisor, or kernel mode, 
as programmed based on the R-CpuPerfCount/1]/CP0 
Reg25 register.) 


8’he0(-8’hff) ICE9B+ | Core Perf 1 ENUM:CpuScbCoreEvent. See above note. 
See the CpuScbCoreEvent enumeration; it is inserted here 
to avoid duplication in this table. 


P1_CYCLES ICE9A_ | Perf 1 Cycles. 
P1_INSEXEC_BO ICE9A | Perf 1 Instructions executed, bit 0. 


8’he2 P1INSEXEC_B1 ICE9A Perf 1 Instructions executed, bit 1. Multiply by 2 and add 
bit 0 counter for total number of instructions. (Scaled in 
aver software, so read P1INSEXEC instead.) 


[Shes | PLLOAD TCR 
Hera 


8’he7 P1_FLOAT_B1 ICE9A Perf 1 Floating point instructions executed, bit 1. Mul- 
tiply by 2 and add bit 0 counter for total number of in- 
structions. (Scaled in driver software, so read P1LFLOAT 
instead.) 


Pehewhe | ——SC~*dICDA [Reserved SSCS 


6.24.8 CpuConfig Register 


Class 


R_CpuConfig 
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}31 {|M [RR [1 |[__[ Indicates that the Configl register is implemented. 
30:28 RW xX Kseg2 and kseg3 cache coherency algorithm. 


27:25 | KU = = =| RW |X| ssUseg/kuseg cache coherency algorithm. 


24:22 | LLTIME RW Lock timer interval. 000=8 cycles, 001=16 cycles, 
in powers of 2 up to 111=1024 cycles. (SiCortex 
Change.) 


par [SB R[F Simple BE bas modes onabled———SSCSCSCSC—~S 
p20 Sp [ RW__[0__|__[ Instruction Scheduling Disable. 
oe) 


19 WC Unknown. Not documented, but implemented as RW bit 

a 
[pi RW al Tssne Disab —SSSCSCSCSCSCSCSCSCS 
pis [BM RX Burst Mode SSSOSOSOSCSCSCSCSCSCSC*S 
[is__[BE___[R__[X___| __| Big endian byteordermg convention _—S—~S 
rig [AT___[R__|2__ |__| Architecture Type. SSCS 
riz [AR [RO |__| Architecture Revision ———SCS 
por [MTR [a [MU Type SOS 
p20 [KO RW [2 |__| Secifes the Rxeg cache coherency algorithm 


6.24.9 CpuConfigl Register 
Class 
R_CpuConfig1 


P3035 | MMUSweMI[R[X | | Number of atries in the TLB mms one. 
pease PIS [RX |_| Feache sets per way. SOS 
Parag [I _[R |X |_| Feache Ime sze—SSSSSOSOSSSSCS? 
Pisa [TA RX |_| Reache set associativity SSCS 
Pista [DS_____[R__[X___| | D-cache sets per way SSS 
Pio PDE RX |_| Decache ine size 
pov [DA [R__[X___[__| D cache set associativity SOS 
pee RX] Coprocessor 2 implemented 

[ [Performance Comte SS 


6.24.10 CpuConfig2 Register 


Class 
R_CpuConfig2 
Reset | Product 


iwodat 
, 

R 
25:20 
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6.24.11 CpuFCCR Register 
Class 
R_CpuFCCR 


pao [FOC [RW [X |__| Mloating:point condition ode ——SSSSOSCSCS~S 


6.24.12 CpuWatchLo Register 
Class 


R_CpuWatchLo 
> PRW [0 |__| Watch exceptions are enabled Tor mstruction Fetches 


2a 
Tra RW [0 |__| Watch exceptions are enabled forloads. 
po [wr Rw [0 | __[ Watch exceptions are enabled for stores ——SS—=S 


6.24.13 CpuWatchHi Register 
Class 


R_CpuWatchHi 


3l M R Only one pair of WatchHi/WatchLo registers are imple- 
mented. 


ps0 eR Goa match SCC 
peste [ASD pRW_ [xX | [ASD SSCS. C'.CC*‘(™+d 


fips chase osc | Bit mask that qualifies the address in the WatchLo regis- 
ter. 


6.24.14 CpuFEXR Register 
Class 


R_CpuFEXR 


Piri [ Cause [RW |X | [Camobis—SSSOSC—~—SSC“C;C‘;COSStStSt*” 


P62 [Flags [RW [X |_| Fhgbits —SCSCSCSC~C~SF 


6.24.15 CpuXContext Register 
Class 


R_CpuXContext 
-osa3 | PTEBase [RW |X |_| Page Table Bony Base 
raxstp RR PX || Region, SSSSSSSOSOSSSCCCCS 


30: [BadVPN2_[R [|X |__| BadVAdarvegister ————SSSSOSCSCSSSCSCC'? 


6.24.16 CpuDebug Register 
Class 


R_CpuDebug 
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SAPP a 

pow Ro eng Mode SSOSOSOSCSCSCSCS 
Sot a <7 
pas [Isnt [RW [0 |__| Toad Store Normal Memory. SSS 


27 Doze Processor was in low-power mode when a debug exception 
fetes edete sda ieee ee ee 
26 Halt Internal system F bus clock was stopped when the debug 
ne ceionecsimel, | 


[25 [ Comb [RT |__| Count Debug Mode SSS 
[23 [ MCheckP__[ RW [0 |__| Machine Check Exception Pending ———SS—S 
p22 [CacheEP__[ RW [0 |__| Cache Error Exception Pending __—SSS—S 
pat [DBuseP_[ RW [0 |__| Data Bus Biror Exception Pandimg SSS 
pis [DDBEimpr [R |X __ |_| Debug Data Break Imprecse. SSCS 
Pirate PEITAGver [R_[2 |_| Vewion® SSOSOSCSCSCSCSCSY 
[110 [DExcCode [R_[X |__| Debng Exception Code SSCS 
ad W [Debus Single Step. 
[Debug tantra SOS 
[ [Debug Instruction Break SSCS 
[Debug Data Break Store SSCS 
[ [Debug Data Break Load. 
[ [Debug Breakpomt. SSS 
[ [Debug Single Step. SSS 


R 


R 


6.24.17 CpuDEPC Register 
Class 


OFS 


R 
R 
el 
res = 


R_CpuDEPC 
Dafaition 
DEPC Debug Exception Program Counter. 


6.24.18 CpuPerfCnt Register 
Class 


R_CpuPerfCnt 


M R 1 Another pair of Performance Control and Counter regis- 
neta | 

Wide ICE9B+ | Wide counters. Always 0 to indicate the counters are 32- 
Pent ——} + — bits wide. This bit is part of MIPS Release 2 architecture. 
2 a 


10: 5 Event6 — aE Counter event enabled for this counter. See 6.24.5. Over- 
laps Event. 


| Ee of ee [aaa Feces Counter event enabled for this counter. See 6.24.5. Over- 
laps Event6. 


Counter interrupt enable. Because interrupts are level 
sensitive, clearing the enable near the time when the count 
will overflow may cause an interrupt that will disappear 
before the software services the interrupt. Generally soft- 
ware will ignore such interrupts. 


=< —— ee a $$} 
PoP ext___ Rw ont when EXECS 


May 14, 2014 329 Rev 51328 


SiCortex Confidential CHAPTER 6. PROCESSOR SEGMENTS 
6.24.19 CpuPerfVPC Register 
Register 22, select 0 for event 0. Register 22, select 1 for event 1. 


Class 
R_CpuPerfVPC 


rosoz[vPon [R[x | | Highbitsof PG. SSSOSCSC~S~S—S 


39:2 | VPCL R Xx Event 0/1 Virtual Program Counter. For the last 
event 0/1 during SCB counting, the current vir- 
tual PCs. 


a Reseed 
6.24.20 CpuPerfPEA Register 


Register 22, select 2 for event 0. Register 22, select 3 for event 1. 


Class 
R_CpuPerfPEA 


63 L2HIT R xX Last L2 hit. L2 cache indicated hit for the last 
L1 miss, during SCB counting. Often wrong, see 
bug2674. 

62:60 | L2STATE R xX Last L2 cache state. L2 cache state the last L1 miss 
came from, during SCB counting. Often wrong, see 
bug2674. 

59:56 | L2STOP R xX Last Bus stop. Bus stop number the last L1 miss 
was serviced by, during SCB counting. See Csw- 
StopNum. Often wrong, see bug2674. 

55:48 | ASID R Xx Event 0/1 ASID. For the last event 0 during SCB 


parae{ iT Si i Yes 


RSE nn ees) 
35:5 | PEA R Xx Event 0/1 Physical Effective Address. For the last 
event 0/1 during SCB counting, the current phys- 
ical effective address of the last D-Cache hit or 
miss. Note that this might not be the miss address, as 
a DC hit-under-miss following the miss will report the 
address of the DC hit. 


peo Reserved SSOCSCSOSSOSCOCSCSCSCSC*S 


6.24.21 CpuFENR Register 
Class 
R_CpuFENR 


PTE Pe PR tg 


DRS RW sh to ere 
a a a 


6.24.22 CpuErrCtl Register 
Class 

R_CpuErrCtl 

Attributes 
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-kernel 


3l CorEna RW Parity/ECC correction enable. SiCortex: Set to 
enable correction of ECC errors. Note ICE9A con- 

tains bug1965: reads of this bit are seen in bit 28. 

Parity Overwrite. SiCortex undefined behavior, 

esr RW ‘Way Selection Test. SiCortex undefined behavior, 
pr pep] i 


ieallinos RW Enable Parity ECG reporting. (SiCortex addi- 
tion) This bit is automatically cleared by HW be- 
fore invoking the Cache error trap handler. The 
OS Cache Error Trap Handler needs to reenable 
reporting before returning. Note ICE9A contains 
bug1965: reads of this bit are seen in bit 31. 


a icant hee bit 1 in all ECC generation trees, for diagnos- 
fie ECC error generation. (SiCortex addition) 


bala Ed Flip bit 0 in all ECC generation trees, for diagnos- 
tic ECC error generation.(SiCortex addition) 
| 25:8 | 8st | =| Reservede eee 
Ls | -;—— a bits read or written to a cache data RAM. 
SiCortex undefined behavior, must be zero. 
6.24.23  CpuCacheErr Register 
Class 


R_CpuCacheErr 


p29 EDR Xo Data. Single or doubley)——SOSOS—SCSCS 
ed a el oe BY 


25 | __—*| Additional data cache error, data cache | Additional data cache error. 


R Error Fatal. SiCortex: Only set for double bit er- 
rors. 
R 


Error Way. SiCortex: Often incorrect for D-Cache, 
bug1575. 


me | Wag Wag 
15:0 Index R Index. SiCortex: Often incorrect for D-Cache 
probes, bug1575. 


6.24.24 CpuTagLo Register 
Class 


22 


R_CpuTagLo 


PTagLo }RW [|X |__| Specifies the upper address bits for the cache tag. 
76 | PState [RW |X [Valid dirty Tne 


15 [Lb ~=~—6T RW. Xs State of the lock bit for the cache line. 
po PRK ity for the cache tag 


6.24.25 CpuDataLo Register 
Class 


R_CpuDataLo 
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Data read from the data array of the cache. 


6.24.26 CpuDataHi Register 


Class 


R_CpuDataHi 
| 31:0] Data  |RW |X ~~ |__| High-order data read from the cache data array. 


6.24.27 CpuErrorEPC Register 
Class 


R_CpuErrorEPC 


Datition 
ErrorEPC Error Exception Program Counter. 


6.24.28 CpuDESAVE Register 
Class 


R_CpuDESAVE 


Definition 
DESAVE Simple Read/Write register. 


6.24.29 CpuDCR Register 
Class 


R_CpuDCR 


29 ENM R Xx Endianess in which the processor is running in Kernel and 
Debug Modes. 


ri7_| DateBrk | R |X| | Data hardware breakpoint is implemented. 
16: | InstBrk; =| Ro x Instruction hardware breakpoint is implemented. 


fe th le kd Hardware and software interrupt enable for Non-Debug 
Mode. 


NMIE fee ie ef Non-Maskable Interrupt (NMI) enabled for Non-Debug 
Mode. 


NMIpend Indicates pending NMI. 
FESR [RW [1 [J] 8 reset is ally enabled. 
}0 | ProbEn [|R [X [|__| Probe services accesses to dmseg Reads as zero. 


6.24.30 CpuFCSR Register 
Class 


R_CpuFCSR 


May 14, 2014 332 Rev 51328 


SiCortex Confidential 6.24. REGISTERS AND DEFINITIONS 


Paias2s [FCC [RW |X |__| Ploating point condition codes SSS 
mFS RW «XP «dosh toe. SSCS 
pas [RO RW XP tus Overside SOS 
par PN RW sto Nearest 


Pirate | Cause [RW [X___[__| Cause bits ———SSSSSSSSOSOS—SOSC*?r 
pin? —[Bnables [RW _[X__[__| Enable bits —SSSOSCSCSCSCSS—SCCC—S 
[62 [Flags RW X Pag its SOS 
peo [Rw Rw Rounding mode 


6.24.31 CpulIBS Register 
Class 


R_CpulIBS 

P30 ASIDsup [| R _[ 1 |__| ASID compare is supported in instruction breakpoints. — 
p27aEBCN [Rf |__| Number of instruction breakpoints implemented. 
P30 BS30— RWW] Breakstamas, 
P3583 PR DAW [| Brealestatus. Overlaps BSS 
p2_Bs2__ RW Wo |__| Break status. Overlaps B530,_— 
[ast Paw Davo] Brealestaius Overlaps BS30,——— 
po Bs0____ Rw Pwo | [Break status. Overlaps 530, 


6.24.32 CpulIBA Register 
Class 


R_CpuIBA 


Instruction breakpoint address for condition. 


6.24.33 CpulIBM Register 
Class 


R_CpulIBM 


IBM R/W Instruction breakpoint address mask for condition. 


6.24.34 CpuIBASID Register 
Class 


R_CpulBASID 
ASID |-RW = |X ~~ |__| Instruction breakpoint ASID value for compare. 


6.24.35 CpulIBC Register 
Class 
R_CpulIBC 


ASIDuse }RW = |X ~~ | [| Use ASID value in compare for instruction breakpoint. 
[2 |TE = [RW [0 |__| Use instruction breakpoint n as triggerpoint. 


[0 [BE {RW [0 |_| Use struction breakpomt n as breakpoint. 
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6.24.36 CpuDBS Register 
Class 


R-CpuDBS 
ASIDsup }R = =[1 {|__| ASID compare is supported in data breakpoints. 


NoSVmatch fae ol Value compare on a store is supported in data break- 
points. 


NoLVmatch |R === [0 ~~ | ~—_|_ Value compare on a load is supported in data breakpoints. 
27:24 | BCN [RR [2 ~~ | ~___|:~ Number of data breakpoints implemented. 
BS10 RW 


1:0 xX Number of BS bits implemented corresponds to the num- 
ber of breakpoints indicated. 
6.24.37 CpuDBA Register 


Class 
R_CpuDBA 


Dafaition 
Data breakpoint address for condition. 


6.24.38 CpuDBM Register 
Class 


R_CpuDBM 
Definition 
DBM Data breakpoint comparison mask. 


6.24.39 CpuDBASEID Register 
Class 


R_CpuDBASEID 
ASID ;-RW [|X ~~ | __[ Data breakpoint ASID value for compare. 


6.24.40 CpuDBC Register 
Class 
R_CpuDBC 


pas ASIDuse [RW [|X |__| Use ASID valle in compare 
pada [BAO [RW [|X |__| Byte accoss ignore. SSCS 


113. | NoSB | RW [xX [| Condition can be fulfilled on store access. 
}12 |NoLB | RW = [X ~ [____[ Condition can be fulfilled on load access. 


ri [ BEM [RW [xX |__| 3 
2 [TE [RW [0 |__| Use data breakpoint as triggerpomt——S—S 
fe 3 


Compare corresponding byte lane. 


CE 


6.24.41 CpuDBV Register 
Class 
R_CpuDBV 


Data breakpoint data value for condition. 
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6.24.42 Cpulndex Register 
Class 


R_CpulIndex 


Par_[P [R[x | |[Probefare SSCS 


[5:0 [Index| RW. |X |__| Index to the TLN entry used by the TB read. 


6.24.43 CpuRandom Register 
Class 


R_CpuRandom 


[a0 [Random [RX | | 11BRandm index —SOSC~=~“~“~*~*S*~*~“~*~*~*S 


6.24.44 CpuEntryLo Register 


Class 


R_CpuEntryLo 


[Paso hancNmba SSCS 

[| Coherency attribute ofthe page 
A 
Ri 
poe Rw ba i 


6.24.45 CpuContext Register 
Class 


R_CpuContext 


Reset | Type 
G53 | PTEBase 


BadVPN2. | RW |X |_| Virtual address updated on exceptions. 


6.24.46 CpuPageMask Register 


Class 


R_CpuPageMask 


24:13 | Mask | RW |X |_| Mask indicating which bits of VA must match. 


6.24.47 CpuWired Register 
Class 


R_CpuWired 


pao [Wied [RW [0 | [TB wid boundary. SSCS 
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6.24.48 CpuBadVAddr Register 
Class 
R_CpuBadVAddr 


BadVAddr Virtual address that caused an exception. 


6.24.49 CpuFIR Register 
Class 


R_CpuF IR 


ho Ops> PR 0 MIPS ASH maplonented 
(18 {PS = {[R {0 | | Paired-single floating-point implemented. 
17 


Df R__ [1 |__| Double-precision Hoating-pomt implemented 
S PR [1 ___ |__| Single-precision floating-point implemented. | 
15S | ProcesorD | R__[0xS1_[ | Floating-pomt procossor type SSS 
P70 [Revision [R__[X |_| Matches CPO PRId register. _—S 


6.24.50 CpuCount Register 
Class 
R_CpuCount 


31:0 | Count RW xX Interval counter. Counts every other pclk. Zeroed, then 
starts counting after reset. This allows all CPUs 
to have the same zero start time for exact cycle 
release-from-barrier. 


6.24.51 CpuEntryHi Register 


Class 
R_CpuEntryHi 


63:62] R = | RW [|X |__| Virtual memory region, corresponding to VA63:62. 
5140 Fi bits, 


Lor = 
3913 [VPN2___[RW__[X |__| VA39:13 ofthe virtual addres ———SOSOSC—~S~SCS 
p70 _[ASID___[RW_[X |__| Address Space Identifier. __—S—SS 


6.24.52 CpuCompare Register 
Class 


R_CpuCompare 


P3r0 | Compare [RW |X |__| Interval count compare value 


6.24.53 CpuStatus Register 
Class 
R_CpuStatus 


May 14, 2014 336 Rev 51328 


SiCortex Confidential 6.24. REGISTERS AND DEFINITIONS 


par_ [ous [RW |X |_| Copromsor Umble ——SOSCSCSCSC~“~“*“*S*S 
30_[our [RW X |__| Coprocessor Usable. SSCS 
p29 [our [RW [X__|-___| Coprocessor Usable. SSS 
pas [cua RW Px] Coprocessor Usable. 
par_[RP____[RW_[X |__| Reduced power. ——SSSSOSOSCSC~*?F 
p25 [FR [RW [X___[-__| Floating-point register mode 
p25 [RE [RW [X [|__| Reverse Enda. SSSOSOSOSOSOSOSCSCSCSCS*S 
pea ix RW] nae access to MDX resources on proce 
ps [ PX [RW [xX [| 

an 

Fa a 1 


== 
pio NMI RW XJ 
rise [iM [Rw |x |_| 
a 
re_[Sx___;RWw_/x |_| 
5 [ux [Rw__|x | __ 
pas_[ sow [xf 
p2 PERE pw x 
EXE Rw xX _ 
poe wx] interrupt Enable 
6.24.54 CpuCause Register 
Class 
R_CpuCause 


rar__[BD__[R_[X |__| Branch Dalay 
paoas [CE___[R__| X |__| Coprocessor Exception: 

pas IVP RW [X_| __| Interrupt Vector. ——SC—CS~—SCS 
pe fwe RW [TX _| __| Watch Postponed SSCS 
ris__[P7_—* RX |_| Internupt Pending ——SCS—~—SCSCS 
pias PR [X |_| interrupt Pending ——SSOSOSCS—SSCCCC‘*'/ 
ris fps_—sP RX |__| Interrupt Pending SCS 


pipes RX [interrupt Pending SSCS 
Pa paps____[ R___[X__| [Interrupt Pending 
pio paps RX interrupt Pending SSCS 


X__[___[ntermupt Pending, SSS 
ps pape RW XP intorrupt Pending SSCS 
P62 | BxeCode_[R |X |_| Exception Code. SSS 

6.24.55 CpuEPC Register 


Class 


R_CpuEPC 

Exception Program Counter 
6.24.56 CpuPRId Register 
Class 

R_CpuPRId 
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Attributes 


-kernel 


Definition 


31:24 | CompanyOptions Available to the CPU core user for company-dependent 
options. Overlaps Allowed. 


| 31 =| OneCpu | ss « Single core mode. Set in simulation model only. 


28:24 | CoreNum Core number (0-5) on the chip. (SiCortex en- 

ee | ——_ 

23:16 | CompanyID R 14 Company that designed or manufactured proces- 

eee ed sor. 1=MIPS, 14=SiCortex. (SiCortex change.) 
ProcessorID pins AddrProduct | Type of processor. Returns ICE9, ICE9B, etc. 


| 7:0 | Revision = [|R [1 [| Revisions of the same processor type. 


Note Revision not incremented between ORCS and ICE9A1. To determine ICE9A vs ICE9A1 read Rev field 
of SCB register R_ScbChipRev. 


6.24.57 Ecc Injection Magic Register 


The cache ECC Magic registers are used to generate L1 ECC errors. This is implemented only in the verification 
model, for testing purposes. 


Register 
R_CpuxEccInjMagic 


Attributes 


-noregtest -noregdump 


Address 


0x00_0400 (plus base address) 
Pst [Go |W [0 |__| When written one, toggle bit as spoofed] 
30 [Teache [ W[0 |__| Write TCache, else if zero wite DCache, 


p29 [Tag Wo |_| Wiite Tag RAM, dlse if zero waite data RAM 
7 OO 


ee 


33: 16 | Bitnum Bit number in pigaieal RAM to toggle. Includes both 
data, ecc, and parity bits, where enumberation depends 
on internal RAM organization. 


a 
ps0 [index [TW [0 |__| Cache ndextowite SSCS 


6.25 EJTAG Registers and Definitions 
6.25.1 EJTAG TAP Instructions 


Enum 


CpuTapInstr 


Datnition 
IDCODE Selects Device id 


IMPCODE Selects Implementation register 
ADDRESS Selects Address register 


May 14, 2014 338 Rev 51328 


SiCortex Confidential 6.25. EJTAG REGISTERS AND DEFINITIONS 


ShA 
THB 
Zhe 


5’hD NORMALBOOT | Disables debug exception after reset 
5S’hE FASTDATA Selects the Data and Fastdata register 
Sh1iF BYPASS High-order portion of the TLB entry 


6.25.2 CpuTapIDCODE Register 


Class 
R_CpuTapIDCODE 
Attributes 
-tapSize=32 
Definition 


31:28 | Version R xX Identifies the version of a specific device. In ICE9A 
ae i eh turns one. In ICE9B and followons returns processor nt 
ber. 
27:12 | Part R xX AddrProduct | Identifies the part number of a specific | 
vice. In ICE9A contains value of ICE9_CPU( 
ICE9_CPU5 as appropriate. Later passes cont 
ICE9*_CPU. 


11:1 ManufID R SICORTEX | AddrTapMfegr | Identifies the manufacturer identity code of as 
cific device,. 


6.25.3 CpuTapIMPCODE Register 
Class 


R_CpuTapIMPCODE 


Attributes 


-tapSize=32 
31:29 | EJTAGver |R | X ~~ | ——sd|sEJTAG version implemented. 


DINTsup |[R == |X |__| Support for the DINT signal from the probe. 
22:21 | ASIDsize |R [2 | | Size of the ASID field. 


pis [Marsis | R__[0 |__| MIPSIG ASE is supported SCS 
[i [NoDMA_[R__[1___ | __[ndieates no BITAG DMA suppor 
po paapsea [Re bit processor SCS 


6.25.4 CpuTapDATA Register 
Class 
R_CpuTapDATA 


Attributes 


-tapSize=64 


uint64_t | Data used by processor access. 
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6.25.5 CpuTapADDRESS Register 
Class 
R_CpuTapADDRESS 


Attributes 


-tapSize=36 


Address used by processor nec 


6.25.6 CpuTapECR Register 
Class 
R_CpuTapECR 


Attributes 


-tapSize=32 


Par [Roce [RW [1 |__| Soft resct has ocommed since ast bit eared. = 
ps0ag [Pz [RW X_ |__| Size of pending access. 0=byte, [=HW, 2=W, 3=DW_| 
paz [Dow [R__[0 |__| Processor m low-power mode. SSS 
par [Halt [R17 [__nternal lok is ramming. SSS 
p20 [Pest [RW [0 |__| Paripheralreset. ——SSOSCSCSCSCSSSSC~*r 
pig [PRnW_[R [| X__[_[ Read not write processor acca ——SSSSCSCSCS~*S 
pis [Prac [ RW [0 |__| Pending processor aceess_——SSSSSOSCSCSCSCSCSCS 
pis [Past [ RW [0 |__| Apply processor reset. SSCS 
[15 __[Probima | RW_[X___ |__| Probes will be serviced by BITAG.——SSSSS—S 
[id [ProbTrap__[ RW_[X |__| Relocates debug exception vector. SS 
Ps [BitagBrke [RW _[X |__| Requests debug exception SSS 
ee Ses ee 


6.25.7 CpuTapFASTDATA Register 
Class 
R_CpuTapFASTDATA 


Attributes 


-tapSize=1 


}O | SPrAcc | RW |X ~~ |__| Zero if processor action completed. (See documentation.) 


6.26 Cpu Implementation-Only Definitions 


6.26.1 Request Commands 


These encodings are used for cpu_cac_reqCmd_pr. 


Enum 


CpuReqCmd 


3’b000 | sd TWC9A+ | Reserved. (If we remove Valid, this becomes the idle) 
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-sborr [INV] 


Ptelse) —_P WO9AT Reserved: 


6.27 Cac Registers and Definitions 
6.27.1 Probe Queue Handler States 


This is the encoding for the probe queue handler state machine in the CAC portion of the processor segment. 


Enum 


CacPrbQState 


Wait for writeback to complete 


LIPRBDN_WT | Wait for L1 probes to complete. 
5’7h13 CHK_NHACK Wait for NOHIT to complete in CMX 


6.27.2 Processor Interface Ready State Machine 


Enum 


CacRdyState 


IDLE Wait for the next request or a pause 
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PAUSED Pausing to honor BIU pause request from CTL or DAT 
unit 


4Vh7 PREPAUSE | We’re about to pause, but we should check first to allow 
one last read to sneak in, if necessary. 

VhE PREDOP1 We'd like to send out a pending op, but we need to wait 
two tics. 

4VhF PREDOP2 We'd like to send out a pending op, but we need to wait 
one more tic. 


6.27.3. L2 Cache Pause During Fill State Machine 


Enum 


CacDpseState 


IDLE Wait for a new data block to arrive 
WT0 


27hl 


| WTO |: Wait for either BIUPaused or the last stage of Fillldx 
WT4PSED | Wait for BIUPaused to be asserted 
WT4FIDX | Wait for the last stage of Fillldx 
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Chapter 7 


L2 Cache Coherence and Switch 


by Jud Leonard and Matt Reilly. 
[$Id: L2Cache.lyx 49898 2008-01-22 14:26:37Z zeno $] 


7.1 Summary 


The ICE9 node chip implements a 1.5 MByte L2 mixed instruction and data cache that is accessible from all 
six CPU cores, PCI-Express, and the DMA engine. The L2 cache is split into six segments, each closely connected 
with a single processor. Each L2 cache segment is 2-way set associative with a 64 byte line size, with writeback 
policy and allocation on read or write miss. It acts as a proper superset of the L1 data caches in the cores, and 
maintains coherence among them by enforcing exclusive ownership of writable blocks. The L2 supports coherent 
shared access among the cores without reference to main memory. 

This section describes the Central Cache Switch (CSW) and the protocol that manages cache coherence and 
data movement among the processors and I/O devices on the ICE9 node. The first sections of this chapter give a 
general outline of the approach and present some notes on how we got here. The latter sections (beginning with 
Section 7.10) present detailed descriptions of transaction flows and responses. 

For a more detailed outline of the Processor to L2 organization, see Chapter 6. For an explanation of the DMA 
interface to the L2 and CSW, see Chapter 5. For more information on the PCI Express controller and other I/O 
devices, see Chapters 15,13, 14, and 10. 


7.2 Differences, Bugs, and Enhancements 


7.2.1 Product and Chip Pass Differences 


1. TWC9A’s L2 cache is part of the new IceT core, and is described in a different document. 


2. TWC9A adds the CswStopNumTwe and CswTidTwe enumeration to support more cores, and more TIDs 
per core, bug3377. 


3. NEED IMPL: TWCOA fixes the R-CacxIntCr[#]_Overflow bit being mis-cleared when clearing R-CacxIntCr|[#]_Active, 
bug3165. 


4. NEED IMPL: The R-CohxEccMode_CorEna bit must be set whenever the ICE9 caches are active, bug1990. 
5. NEED IMPL: TWC9A pushes IO writes instead of using a special command, bug4898. 


6. NEED IMPL: TWC9A removes SPCL in favor of IO writes, bug4899. 


7. NEED IMPL: TWCO9A stalls issuing probes to avoid large per-cpu probe queues. 
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7.2.2. Known Bugs and Possible Enhancements 


7.3 L2 Cache Features 


The L2 cache stores 1.5 MB of data. It is structured as six 256 KB cache segments to provide sufficient 
bandwidth for 6 cores, and to minimize the typical access latency. Each segment is 2-way set associative. The 
cache is interfaced to two DDR2 SDRAM memory controllers, interleaved on the cache line size, 64 bytes. 


e Line size = 64 Bytes (2°) plus ECC on 8-byte doublewords 

e Number of tags = 24K (32!) total 

e Associativity = 6 Segments, 2 way associative. 

e Index size = 11 bits {(address <26:17> xor <16:7>), address <6>} 
e Tag, state, and all data are ECC protected 

e Replacement = LRU nearest requestor 

e Physical Address = 36 bits 

e Protocols = Snooping, Writeback, Subset 


Every processor request is attempted first in the local L2 segment. If it misses, the request is directed to one of 
the coherence controllers (at the memory interface), as selected by bit 6 of the address. The request must arbitrate 
for use of the memory request /address bus toward the selected controller. The coherence controller looks for the 
requested address in a duplicate tag store (the master); it may match in one or more of the tags corresponding to 
other processors. In the event of a hit, the controller redirects the request to the hit segment, which will return the 
block to the requestor and, in the case of a data-stream fetch, transfer ownership to the recipient. 


7.3.1 Terminology 


Block The unit of memory identified by one tag in the L1 cache, consisting of 4 doublewords (32 bytes) with byte 
parity. Synonymous with half-line. 


Clean The state of a memory block which is known to be unchanged with respect to the value in memory. A clean 
block can safely be discarded. 


Dirty The state of a memory block which has been modified since it was read from memory. It must be written 
back to memory (victimized) before its space in the cache is reclaimed. Synonymous with Modified. 


Doubleword 8 bytes (64 bits). The standard size of data values in the 5Kf microprocessor, and the width of most 
data busses in the chip. 


Exclusive The state of a cache block which ensures that it belongs to exactly one L2 segment and possibly the 
associated L1. The processor is permitted to modify a block if and only if it is in the exclusive state. It is 
allowed that a block be in only one segment without exclusive state, but not allowed to have exclusive state 
when there is a copy in more than one segment. 


Line The unit of memory identified by one tag in the L2 cache. It consists of 8 doublewords (64 bytes) with ECC 
on each doubleword; equal to two blocks. 


Segment One of the six 256 KB partitions of the L2 cache, consisting of a 2-way set associative cache with 64-byte 
lines and 2K sets. Each segment stores lines that have been accessed by the processor with which it is paired; 
data in any segment can be used to satisfy a cache miss, and writes are kept coherent among segments. 


Shared The complement of exclusive state; a block in shared state is readable to any processor’s instruction cache, 
but must be transitioned to exclusive state before it can be accessed by the data cache (and therefore written). 
It is possible for a block to be in shared state while being in only one segment. 


Tag The auxilliary information stored with each line of a cache, indicating where that line belongs in main memory 
and its state with respect to memory. 
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Updated The state of a cache block after it has been written by the currently owning processor. That is, a block 
enters into an L2 segment in the EXCLUSIVE or DIRTY state. If the block is then written by the associated 
processor, it enters the DIRTY and UPDATED state. (Updated or Dirty blocks must be written to memory 
when they are evicted.) The Updated state is left over from an earlier complex scheme for maintenance of 
the LoadLinked/StoreConditional state. See Sections 7.8.1 and 6.6.10. 


Figure 7.1: Address Partitions 


Virtual Address 


63 61 40 23 21 19 17 15 13 11 3 0 
E i : “| : 
S SS SS ~ NESSES ee Wy Y 
| : itertatetecstatecststecetseMMMA, 
t ~<=—— Must be equal to bit 63 Virtual Page Number = ——————____> 
Region Page Size variable 4KB to 256 MB in 4x steps 


Physical Address 


63 61 58 36 23 21 19 17 15 13 11 3 0 
XN “YY, 
i =———— lgnored >< Physical Page Number 9=——W——> 
Cache Coherency Attribute 
Region 
L2 Line 
Cache Address a - Address Tag ———> 
63 35 34 26 1716 7654 32 0 
ee 


KXXXD 
Neeegenecectgegentetet 
| RXR SLID 


or os 
pecetareetareetares 
< Ignored al <= 12 Index ————>_ || Block 
{({26:17] xor [16:7]), [6]} 
/O 
DRAM Address (breakdown variable according to device) 
63 31 16 13 765 32 0 
SONS ~ SALE oe 
RQQAiN LA4AgZ 
Ignored Row Select ——> t Sohect 
Bank Select Interface Select 


7.3.2 Unusual Features 


For those familiar with other cache designs, this one holds few surprises. It can be understood as six processors 
with separate snooping L2 caches. The major difference is that snooping uses a central “coherence controller” which 
keeps the master tags and victim buffer. The coherence controller maintains an accurate representation of the 
contents of all the cache segments, and need not take cycles from the segments unless a state change is required. 

It is also unusual that this design does not support the “shared” state for data blocks (it does allow shared 
instruction blocks). The drive behind this decision comes from the fact that the MIPS L1 design does not provide 
a shared state separate from exclusive: the data cache will permit a write to any block it holds. We thought about 
redesigning the dcache controller, but at this point it doesn’t appear that the performance impact of shuttling 
blocks between segments is so severe as to justify the risk and design effort. 
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7.3.3 Error Control 


The L2 cache data and tag arrays are protected by a single-error-correcting, double-error-detecting (SEC/DED) 
Error Correcting Code which requires 8 ECC bits for each 64-bit doubleword of data. The normal read access path 
allows time for detecting and correcting errors in the tag or data arrays. 

The cores expect parity on data blocks. L2 reads will correct and report single-bit errors, and present the 
corrected doubleword with valid parity. L2 writes will check parity as presented by the processor, and compute 
ECC. 


7.4 Processor to L2 Cache Interface 


NOTE: This section is dated. See the processor chapter 6for the current interface description. 


7.5 Major Blocks and the General Approach 


The L2/CSW implements a split transaction MESI (Modified, Exclusive, Shared, Invalid) cache coherence 
protocol. Each node on the daisy-chained pair of buses is connected at a “bus stop” and may initiate requests 
via the chain to any other node. Memory acceses are all sequenced through one of two coherence controllers. 
(Each controller is responsible for one of the two DIMM slots.) A fill request (caused by an L2 miss) is sent to 
the appropriate coherence controller and checked against its shadow copy of each processor’s L2 tag array. If the 
required block is not found in any other processor’s segment, the request is satisfied by the associated DRAM 
controller. 

If the coherence widget finds a tag match in some processor’s L2 segment, the request will be forwarded to the 
appropriate processor and ownership will be transfered, if necessary. 

In addition to normal cache transactions, the CSW and L2 protocols support block read and write operations 
from I/O and fabric devices. That is, the DMA engine — for example — may write an entire 64 byte block to physical 
memory. If the block is currently cached by a processor segment, the DMA engine will transfer its data directly to 
the L2 cache. 

The following sections introduce the basic components and operations in the L2 CSW and Coherence widgets. 
More detailed information is presented in Section 7.10. 


7.5.1 Supported Operations 


Each processor may originate memory read and write transactions. Each memory transaction moves 64 bytes to 
and from a DRAM unit or another processor. A processor may have no more than 1 such transaction outstanding. 
L2 cache fills that may require victimization of a block will cause a processor segment to initiate a read-with- 
victimization operation (RDV or RDSV). Such operations count as one transaction, though the processor segment 
will write one block to memory and receive a second block for the fill. 

The DMA engine and the PCI express controller may initiate block transfers of 32 or 64 bytes. Block read 
operations transfer data from DRAM if it is not cached, or are forwarded to the appropriate L2 cache segment. 
Block read transfers cause no change in ownership of the block — it stays in the owner’s cache. Block read operations 
are always 64 bytes long. Block write transfers may either send data to the DRAM or — if the block is cached — will 
overwrite the cached copy. Again, block write transfers cause no change in ownership of the block, and are atomic 
as far as a processor may observe. 

Any unit on the CSW may originate and may accept I/O read and write transactions. All I/O transfers are 8 
bytes long. 

Processor segments must accept interrupt delivery transactions from any other unit on the CSW. 

Any unit may accept special accelerated I/O write transfers, via the SPCL transaction. However, only the DMA 
engine supports SPCL, so SPCL to any other device is unsupported. (See Section 7.10.6.) 


7.5.2 Per-Processor Segment 


Figure 7.2 shows the structure of one segment of the L2 cache, standing between the Processor’s Bus Interface 
Unit and the Central Switch which connects all segments to the Coherence Controllers, and through them, to the 
memories. 
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Figure 7.2: Segment Block Diagram (See Chapter 6.) 


Processor BIU 
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Req/Address Victim Write Data 


Central Switch (CSW) Interface 


7.5.3 Bidirectional spine structure 


Each processor communicates with memory and I/O through its associated L2 segment. The L2 caches, the 
DMA engine, and PCI-express interfaces share two busses, one to each of the coherence controllers. Processors 
use a 64-bit interface at 500 MHz, which is converted at the interface to 128 bits at 250 MHz in the L2 segment 
and on the Even and Odd-bound busses. (We don’t use the more obvious East and West directions for historical 
reasons. The Even bound bus chain carries data from each bus stop (connection point) to the Even bank of memory 
(address|6] = 0) on the east side of the die. The Odd bound bus carries data from each bus stop to the Odd bank 
(address[6] = 1) on the west side of the die.) 


Figure 7.3: Chip Floorplan 
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d PHY: 
SKE Core an ? SKE Core SKE Core 


A very rough floorplan is shown in Figure 7.3. The arrangement and order of units along the CSW may change 
as we refine the routing. 

The floorplan arrangement is chosen to put the major pin fields along edges of the chip: DDR memory interfaces 
(100+ pins each) on east and west sides, PCI-Express (~32 pins) on the north, and DMA Engine/Switch (~120 
pins) on the south. The data arrays are arranged to group in each array bits which will be read and written 
simultaneously, and to line up arrays so that common address and data wires are straight. CSW busses extend the 
width of the chip, to reach all RAM arrays they must touch, and to be accessible to the processor input/output 
ports arrayed horizontally across the die. The CSW buses provide the principle medium for memory sharing among 
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processors. 


7.5.4 Tags 


Each line in the L2 cache is associated with a tag, which includes the high-order physical address bits identifying 
the cached memory block, plus dirty state and ECC bits. Each tag is stored twice: once (the “local” copy) in the 
cache segment close to the processor it primarily serves, and once (the “master”) in one of the coherence controllers 
associated with each memory interface (selected by address bit 6). The local segment also keeps track of the most 
recently used way of each set, for use in replacement decisions. 

The master tags are consulted when any reference misses in the local segment; if they show that the referenced 
block exists in another cache, the block is obtained from there rather than memory. A block may be exclusive (and 
therefore writable) in one segment, or shared (and therefore read-only) in several segments. To exclude the rare 
possibility that a line is dirty in several segments, we will victimize any dirty block when it is read for the i-cache. 


7.5.5 Hashed Index 


The L2 cache and tag arrays are addressed by physical address bits 16:7 XOR 26:17 catenated with bit 6; the 
tag arrays store bits 34:17. Victim addresses are reconstructed by using bits 34:17 from the tag, and XOR’ing 
the array index with bits 26:17 of the tag. Bit 6 is excluded from the tag hash as we must ensure that any block 
victimized from an L2 segment will be sent to the same coherence controller as the controller that will return the 
new fill data. (If bit 6 was included in the hash, we could evict an odd block from the L2 segment and replace it 
with an even block. The protocol described below just won’t work that way.) 


7.5.6 Outstanding Read CAM (ORC) and Write Back CAM (WBC) 


Every read operation in the coherence controller is checked against, and recorded in, the Outstanding Read 
CAM (ORC). The ORC ensures that no new read presented to the coherence controller is allowed to proceed if 
it conflicts with a read operation already in progress. Similarly, we record all write operations in progress in the 
WriteBack CAM (WBC). Both ensure that reads and writes to the same block of memory complete in order. 


7.5.7 Victim Buffer 


We don’t implement victim buffers. Since all operations are sequenced through the coherence widgets and the 
ORC/WBC units, we have no need of “temporary” data storage to cover the ships-passing-in-the-night problems. 


7.6 I/O and DMA Transactions 


I/O transactions are initiated by Load and Store instructions from the processors, where the physical address 
refers to I/O space (see Table 7.1). The L2 segment misses (because I/O space addresses are not cached), and the 
request is presented to the CSW with a target which selects the addressed device (processor, DMA engine, PCIe 
adapter, Memory Controller, etc. Each L2 segment is permitted to have only one I/O request outstanding at a 
time; Read requests are completed by the return of read data. 

Write transactions are special. Imagine that a processor X initiates a read miss transaction to get a block of data 
from physical memory. Now imagine that processor Y attempts to write data into a control register on processor 
X. It is possible that the data for Y’s write could arrive at X’s bus stop at the same time as the read miss data. 
We'd have to buffer one of the items. In fact, we could imagine having to buffer several items. That’s expensive for 
an improbable circumstance caused by a low-frequency operation like an I/O write to a processor control register. 
To simplify the hardware, we require that all data arriving at a bus stop be “pulled” by the recipient. So, when 
processor Y wishes to write an I/O register in processor X, X will register the write request, and reflect a READ 
IO request back to processor Y. Processor Y will answer with the data that it wishes to write. (See Table 7.53.) 

DMA transactions are initiated by an I/O device connected through the DMA engine or PCIe adapter, and 
typically reference main memory, but the coherence controller checks each such reference against the master tags. 
In the event that a read matches, the request is completed by probing the owning cache segment without taking 
exclusive ownership. When a write matches, the DMA data overwrites the old contents of the cache segment, 
leaving it valid and modified. 
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7.7 Coherence Interactions 


The data cache segments with each of the processor segments have five states for every block: invalid, shared, 
exclusive clean, exclusive dirty, and exclusive updated. The cores can change a line from clean or dirty to updated 
without informing other cache segments. 

The cores make only two kinds of requests to the L2 cache: reads and writes. Requests may be qualified in 
various ways; see Table 7.17.3. 


7.7.1 Races 


The master tags always change before the local tags, and tag changes are protected from conflict by the OTC. 
The OTC ensures that any new incoming request is queued while an earlier request for the same block is in progress. 
When a processor segment evicts a block B from its L2, it must set the block to INVALID in the L2 before issuing 
any victim write command (or a read command with an implied victim writeback) to the cache switch. 

Transfers must notify the coherence controller upon completion, so that any other requests queued for the same 
block can be cleared. Completion is identified by the Transaction ID code generated by the originator of the request, 
and is sent by the originator to the coherence controller, which knows the address and former owner of the block. 

Because of the sequencing and the dependence chain maintained in the coherence controllers, processor segments 
need not compare incoming addresses to the L2 writeback buffer. If a victim has been identified and a writeback 
command has been sent to the coherence controller, the PS must return a PROBENOHIT response to the requestor. 
The requestor will then retry the read command. Again, the dependence chain maintained in the coherence 
controllers (in the OTC and WBC) ensure that the retried read operation will succeed. 


7.7.2. Probes 


For most L2 cache accesses, we expect that the master tag will show that no cache had a copy of the requested 
block, so the block must be obtained from memory. There are, of course, a few exceptions, and for those cases the 
controller issues Probe requests to the cache segment whose tag matches. A probe request contains the physical 
address of the block in question and indicates to whom the data should be sent. SHARED blocks filling I-stream 
requests are left in the SHARED state in both requester and responder. Blocks filling I-stream requests will cause 
ownership to transfer. 

It is possible that a probe is on its way to a segment while the block it addresses is being victimized from the 
segment. In such cases, the responding segment returns PROBENOHIT and the original requester retries the read. 
The retry, through mechanisms in the coherence controller, is guaranteed to succeed. 

A probe response that involves writeback may take many cycles to complete, because it may be necessary to 
drain the write buffer in the 5kf processor. It is therefore possible to create a backlog of probe requests to a single 
processor. These are serviced in order of arrival in the L2 segment’s command queue. 


7.8 Multiprocessor Issues 


7.8.1 LL/SC 


LL/SC is handled entirely within the ICE9 modifications to the 5kf processor core. When an LL instruction 
access to the L1 completes, the processor will delay processing of all probe requests from the L2 cache for a 
programmable number of cycles. Any probes received for the LL target block after this delay will force the SC to 
fail. For a more complete description of the LL/SC mechanism, see the Section 6.6.10 in the processor chapter.) 


7.8.2 Lockstep cache thrashing 


Typical applications of the SC 1000 will have many copies of the same program running simultaneously; in some 
cases that will result in all the processors of a node accessing the same relative location on different pages nearly 
simultaneously. This would be likely to result in thrashing of the L2 cache if we didn’t do something to prevent it, 
so the cache index is hashed to distribute any page-relative location among many different index values. This does 
not require a larger tag; the address of a victim can be recalculated by the inverse hash function. 
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7.8.3 Deadlock Freedom 


It is necessary to show that the system is always able to make progress; that requires that there can be no 
closed cycle of resource dependencies. 

An L1 D-cache read can be stalled in the read queue waiting for the ORC, which may report a conflict for the 
same cache line. The core cannot request another read while there is one outstanding. 

The ORC frees dependent transactions when main memory requests complete and when ownership transfers 
complete. 

Memory requests complete with the passage of time. Fills have first priority for use of CSW and L2 cycles. 

L2 cache writes (which do not assert transfer) depend only on availability of L2 segment cycles. 

Memory writes complete with passage of time; they have no dependencies. I/O writes to PCI space may depend 
on completion of memory reads or writes. 

To ensure that we can drain the write buffer, the processor will be granted a small number of write credits (just 
enough to keep the pipeline busy) until a probe matches something in the write buffer. At that time, the processor 
will inhibit instruction issue (as if a SYNC instruction had been found) until the external write buffer is empty, 
and the external write buffer will make available enough credits to drain the internal write buffer. The interface 
will separate I/O writes and cached memory writes into separate queues, and update the L2 immediately as the 
memory writes are issued. This will allow the probe to be satisfied despite delays in I/O service. 

And Wilson is terribly afraid that I’m going to forget to keep transfer requests separate from non-transfer 
requests; if a transfer request got stuck waiting for a non-transfer request, we could deadlock. 


7.9 L2 Segment to Memory Interface 


Each segment of the L2 cache includes a block of interface logic by which it communicates with the coherence 
controllers, the memory and I/O systems, and other segments. Figure 7.4 sketches the interface. The interface 
consists of two daisy-chain busses, called Evenbound and Oddbound. Each segment decides, whenever it has a 
request to send, which direction to send it, and watches the Target signals to wait for a cycle in which the bus is 
free. At the same time, it monitors its own target signal to determine when the bus contents are for it. 

Each segment has only one read request outstanding at a time, so there is no danger of receiving data from both 
memory controllers at once, but it is possible to receive probes simultaneously from both coherence controllers; they 
must be captured and queued. I/O devices may have multiple outstanding reads, and therefore need the ability to 
accept two or more responses simultaneously. 


Figure 7.4: Memory Bus Interface 
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7.9.1 Transaction ID 


Every command on the Request/Address bus is accompanied by a transaction id, which identifies the origi- 
nator of the request and uniquely identifies the transaction among the outstanding requests by that originator. 
There are eight originators: the six processors, the DMA engine, and the PClI-express controller. The DMA and 
PCI/PMI units may each have up to four reads and four writes outstanding. A processor segment may have an IO 
read, an IO write, a cache owned-to-shared transfer (WRSTRANS) and a cache fill/replacement outstanding — all 
simultaneously. 
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Table 7.1: Memory Bus Port Signals From and To Processor Segment X 


eswapsX-CmadAddrGni-cTa 


pax.csweData Target. cal] 
psX_csw_{E/O}DataReq_c2a Request for access to data bus 
csw_psX_{E/O}DataGnt_c3a Grant from switch to PS allowing access 


Writing 8 bytes, 64 bytes, first 32 bytes, last 32 bytes. 
Doubleword 0,2,4,6 of block (multiplexed) 


esw=psX_Commandclal 0 
cowapsX_Addr_cla 53 


eswpsX_DataValiela 
csw_psX_HalfMask_c3a[1:0] Writing 8 bytes, 64 bytes, first 32 bytes, last 32 bytes. 
caw_psX-Datal=cBa[ 720 


csw_psX_TIDBusy_c5a|[1:0] A Coherence Widget claims that TID 0 and/or 1 is busy. 


(All six processor segments have identical signal ports. Replace “psX” in the above with ps0, psl... Seg- 
ments can send a command to either the Even side controller or the Odd side controller as designated by the 
{E/O} prefix. So, in fact, segment 0 has two address/command request signals: ps0.csw-ECmdAddrReq_c0a and 
ps0_csw_OCmdAddrReq_c0a.) The PCI interface is identical to the PS interface: replace psX in all signal names 
with pci for this interface. 
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Table 7.2: Memory Bus Port Signals From and To DMA or PCI Segment 


Signal Name 


dma_csw_CmdAddrTarget_c0a|7:0] | Command/Address Destination 
dma_csw_{E/O}CmdAddrReq_c0a | Request for access to command/address bus 
csw_dma_CmdAddrGnt_cla Grant from switch to PS allowing access 


dma_esw-Way-c0a 


dma. csw-DataTarget-clal7] 
dma_csw_{E/O}DataReq_cla Request for access to data bus 
csw_dma_{E/O}DataGnt_c2a Grant from switch to PS allowing access 


Writing 8 bytes, 64 bytes, first 32 bytes, last 32 bytes. 
Doubleword 1 of block 


Byte mask for I/O reads and writes 


csw_dma_RdTIDBusy_c5a[3:0] A Coherence Engine claims that TID|x] is currently in flight 
csw_dma_WtTIDBusy_c5a|[3:0] A Coherence Engine claims that TID|x] is currently in flight 
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7.9.2 Target 


Every transfer on the Request /Address bus or the Data bus is directed to a specific destination, which may be 
one of the originating interfaces or one of the two coherence controllers and their associated memory interfaces. 
When driving the bus, each interface selects either Evenbound or Oddbound direction, depending on the relative 
positions of source and destination. When responding to a request, the target is decoded from the originator 
portion of the transaction id. Original requests are always sent to the coherence controller indicated by address bit 
6 (should be programmable). 

In addition to the Target bits, the coherence controllers can assert Cmd_Beast in conjunction with all the Target 
bits to cause all receivers to accept an invalidate command. 


Target vectors are calculated to have a number of mig bits set equal to the 
distance between the sending and the recieving node. The mleadingm 1 is 
eliminated for the target calculation in all nodes other than the COH. 


targetVectorType bsn2target(fromBSN, toBSN) { 
if (fromBSN is COHO or COHE) f{ 
return shiftLeft(1, abs(fromBSN - toBSN)) - 1; 
} 
else { 
return shiftLeft(1, abs(fromBSN - toBSN) - 1) - 1; 
} 
} 


Table 7.3: Target Addressing 


As shown in Table 7.3, each interface to the Mem Bus generates an 8-bit target mask. The mask determines 
how many downstream interfaces are expected to forward the data. The interface calculates the differernce between 
its bus stop number and the destination’s bus stop number. It then sets that number of bits (less 1) at the lsb 
end of the target vector. When the switch grants a bus cycle to an interface, it augments the provided 8 bit target 
with the request line from the interface. This additional bit is driven downstream as the Isb of the complete (9 bit) 
target. This allows the downstream node to determine if there is live data on the bus. 


7.9.3 Completion 


When requestors receive fill data from other caches (that is, from any element other than the memory controller), 
they notify the coherence controller by sending the transaction id (and possibly other bits). This allows the 
coherence controller to know when it can release any other request for the same address. Fills from memory can 
notify the coherence controller directly. Such notice should be timed to allow a cache hit and transfer, rather than 
initiating another memory request. 


7.9.4 CSW Bus Arbitration 


The memory bus consists of two sets of separately arbitrated wires (see Table 7.1): 
1. Evenbound request /address/data 
2. Oddbound request/address/data 


Each such set has its own arbitration at each L2 segment; the segment can send if and only if (a) it wants to and 
(b) there is nothing on the wires from upstream. Arbitration controls use of the entire set of even- or odd-bound 
wires. Note that read commands optionally transfer a victim, but do not explicitly send the victim address. The 
coherence controller can determine the victim address from the master tags and way select. 

The coherence controller may prevent any bus stop from winning arbitration in order to prevent overflow of the 
DDR controller request buffers. This merely imposes a delay in time, but may not create deadlock opportunities. 
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7.9.4.1 Fairness 


How do we prevent a segment being locked out of bus access by traffic from upstream? First, we should note 
that we can do all kinds of calculations that show that we’ll never really tax the capacity of the CSW or DDR 
controllers. And then we’d find a chip that hung because we taxed the capacity of the CSW or DDR controllers. 
So, the arbitration protocol prevents complete lockout by rationing access to the CSW when there is contention. 

If a bus stop (say the DMA engine) initiates a request in cycle 0, it will find out in cycle 1 if it won the bidding. 
Assume that it wins. It may have triumphed over some other downstream bus stop X. (Note that in the even-bound 
direction, almost everybody is downstream of the DMA engine.) In this case, the DMA engine will not win further 
arbitration for the bus until EVERY downstream bus stop that lost to the DMA bid is eventually granted access 
to the CSW chain. This is implemented completely within the CSW arbitration logic. 


7.9.4.2 Worst Case Traffic Analysis 


Every request to the memory arrays requires one 4ns cycle of the Memory Bus, and eight edges of the memory’s 
DQ bus. We’re designing for DQ bus clock rates up to 400 MHz, so 8 edges take 10ns; thus the memory bus cannot 
be more than 40% saturated by main memory traffic. In addition, inter-cache transfers can occur concurrently with 
main memory access. Each such access encounters a minimum latency of 12 4ns cycles, so the maximum possible 
bus loading is 6 requestors/12 cycle latency = 50%. The worst loading at any point on the bus is less than this 
because the requests have to be distributed among many L2 segments to be requested and serviced that quickly, 
with the result that the bus isn’t occupied for its full length, and the interface in question will be able to share at 
least some of the used cycles. 

We also have to account for the DMA engine and PClI-express controller, each of which can have four requests 
outstanding at any time, but only two of them can be to the same memory controller, and very few of which result 
in inter-cache transfers. 


7.9.5 CSW Queuing of Commands and Data 


At each CSW bus stop, one module can inject commands or data onto the Even or Odd memory bus, and the 
CSW can deliver commands or data to the module. Incoming commands may arrive two per cycle (one from each 
direction), but the bus stop interface can only transmit one of those commands into the module per cycle. The 
CSW contains queues in each bus stop to handle cases where commands arrive too fast. Data can also arrive from 
both directions at once, if a module ever requests multiple data transfers at a time. The processor segments limit 
themselves to one data request at a time, so no queuing is required in their bus stops, but the DMA and PCI can 
make multiple outstanding data requests, so their bus stops require data queues. The depth requirements for each 
queue are analyzed below, for each type of bus stop. 

To know how deep the command and data queues should be, we must identify a worst case number of commands 
that could arrive at this bus stop, and consider how quickly the module can consume the transactions as they are 
coming in. A bus stop could receive one command per TID in the system: 12 processor TIDs, 8 DMA TIDs, and 8 
PCI TIDs. [NOTE: this analysis assumes that INTs consume a TID, and a block will not send another INT until 
a DONE response comes back.] In the worst case, these 28 commands could arrive in 14 consecutive cycles, half 
coming from the even side and half coming from the odd side. Half of them can be consumed by the module, while 
the other half must be queued. So the command queue for each bus stop must be 14 commands deep. For the data 
queues, the answer depends on the number of outstanding data transactions that the module can produce. The 
processor segment is careful to only allow one data transaction at a time, while DMA and PCI can have 4 reads 
outstanding plus some number of WTIOs. 

Table 7.4 summarizes these results. 

The different requirements for bus stops leads to the need for several bus stop variants. The command side of all 
bus stops are all copies of the same module (CswPca), whose queue structure is described in Figure 7.5. Commands 
from even and odd sides are queued if necessary, and the bus stop delivers one command at a time to the target 
module. The processor needs no data queue. The PCI bus stop queues data from even and odd sides, and delivers 
it to the PCI at a rate of two doublewords per cycle (Figure 7.6). The DMA bus stop queues data from even and 
odd sides, and delivers it to the DMA at a rate of eight doublewords per cycle (Figure 7.7). 


7.9.6 Transfer order 


Data transfers on the CSW are ordered to ensure a fixed pipeline timing for each section of the bus, while 
delivering cache miss data to processors starting with the requested word first, and keeping aligned 16-byte units 
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Cmd/Data | Max Arriving, Worst Case Number Consumed | Queue Depth Needed 
12 processor TIDs (probes) 14 in 14 cycles 

+ 8 DMA TIDs 

+ 8 PCI TIDs 

Total: 28 commands in 14 cycles 


Processor Data 1 read response 1 every 4 cycles 
i OKeen 
1 WTIO data word, 
but never both at once 


PCI Data 4 read responses 1 every 4 cycles 
+ 2 WTIO data words 
Total: 6 transfers in 3 cycles 
(The PCI bus stop supports 
two modules which can each do 
WTIOs.) 


DMA Data 4 read responses 3 in 3 cycles 
+ 1 WTIO data word 
Total: 5 transfers in 3 cycles 


Table 7.4: Queue Depth Requirements for CSW Bus Stops 


Command Command 
Addr Addr 
CmdAddrTID an CmdAddrTID 
CmdOrigin CmdOrigin 
etc. CEC: 


from Odd direction 
i) 
8 
oO 

from Even direction 


Command 
Addr 
CmdAddrTID 
v CmdOrigin 
etc. 


CmdAddrValid 
to PS, PCI, or DMA 


Figure 7.5: CSW Queues for CmdAddr Requests 
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together. Note that address bit 3 is ignored, and setting bits 4 and/or 5 result in exchanging the order of halves of 
the block. 


Table 7.5: Transfer sequence as a function of address 


Data0, Datal | Data2, Data3 | Data4, Data5 | Data6, Data7 
07:00, OF08 | 17:10, 1F18 | 27:20, 2F28 | 37:30, 3F:38 
07:00, OF-08 | 17:10, 1F:18 | 27:20, 2F:28 | 37:30, 3F:38 
17:10, 1F:18 | 07-00, OF-08 | 37:30, 3F:38 | 27-20, 2F:28 
17:10, 1F:18 | 07-00, 0F-08 | 37:30,3F:38 | 27:20, 2F:28 


7.10 Detailed Interface and Block Descriptions 
7.10.1 The Normal Flow Of Events, Hazards, and General Ordering Cases 


Almost all the mischief that can happen in a cache/memory system surrounds the handling and ordering of 
reads. Writes almost take care of themselves. So, I’ll attempt to explain the operation of the coherence widget by 
looking at the way read operations interact with other read operations and write operations and the distributed L2 
cache. 

Note that we’re talking about a system with a split bus — that is, a read transaction is split into a read 
request for address A from processor X (which we note as Read(X,A)), and a data response which we'll write as 
ReadData(X,A,D) if we ever need to. Similarly, we break write operations into Write(X,A) and WriteData(X,A,D) 
since the data may be delivered many cycles after the corresponding address. 

The tables below, one for each kind of transaction, describe the sequence of events to carry out the transaction. 
When a unit transmits a command into the cache switch, we denote the operation as CMD(C,U,T,A,W,L,O) where 


C is the command being transmitted. 


U is the target unit to which this command is being sent. It is one of PO, P1, P2, P3, P4, P5, PCI, DMA, COHE, 
or COHO. 


T is the transaction ID. Tx designates a transaction ID that contains the unit for unit X in its upper bits. 
A is the relevant address, or the value to be driven onto the address bus. 


W is the L2 cache way that will hold the returned data. W is not always relevant to a command, in those cases, 
it ommitted. 


L indicates that the block in question had an outstanding load/link operation registered on it by the sending 
processor. L is not always relevant to a command, in which case it will be omitted. 


O indicates an “originator” field. This is almost always optional. When used it will be represented as ORI- 
GIN=value. 


The data portion of the transaction will be represented as DATA(U,T,D,s) where U and T are as described above, 
and 


D is the data to be transfered, either 8 bytes, 32 bytes, or 64 bytes. 


s is the size and placement of the transfer. It indicates that the block is either 8 bytes long, 32 bytes long, starting 
with doublewords 0 and 1, 32 bytes long starting with doublewords 4 and 5, or 64 bytes long. 


The text below refers to the “command bus” or the “data bus.” We don’t really have “buses” in the chip, instead 
we have pipelined-multitap-multiplexed-daisychains, but “bus” is a little easier on the eyes. For purposes of under- 
standing the flow of the transactions, “bus” is a reasonable approximation of what we’re implementing. For more 
detail, see 7.15.2. 
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7.10.2 Transaction Steps and the CSW Buses 


The two bus events described above CMD() and DATA() require signals to be sequenced over several cycles 
or pipeline stages on the CSW ports. For example, CMD(RDEX,COHE,0x6,0x2badbeef0, 1), meaning “Read 
and acquire Exclusive Ownership from the Even Coherence widget, block 0x2badbeef0. Register the new owner 
(processor 3) as caching this block in way 1” appears on the processor port to the bus as shown in Figure 7.8. The 
sequencing for the event DATA(D[7:0], Px, TID, 64) is shown in Figure 7.9. Half block transfers may be to either 
the first 32 bytes of a 64 byte block, or the second. These two transfers are shown (from the processor’s view) in 
Figures 7.10 and 7.11. Finally, 8 byte transfers (used for I/O operations) are described in Figure 7.12. 

The DMA engine interface to the CSW is different from the other interfaces because it has eight 72-bit buses 
in each direction instead of two. Figure 7.13 shows how the data is staged onto Data0-1 in one cycle, then Data2-3 
in the next, and so on. The DMA can send and receive back-to-back transactions on the CSW. 

Ons 10ns 20ns 
| boot | | -*1,| 


cclk 


psX_csw_ECmdAddrReq_c0a 
psX_csw_ECmdAddrTarget_c0a<8:0> ar¢ 
csw_psX_ECmdAddrGnt_cia 
psX_csw_CmdAddrTID_cOa<5:0> I 6° «(| QE 
osX_csw_Command_c0a<1:0> (RK RD! Qs 
osX_csw_Addr_c0a<35:3> 
osX_csw_BMask_c0a<7:0> Ts 


Figure 7.8: Signalling Sequence for CMD(RDEX, COHE, 0x6, 0x2badbeef0, 1) 
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Ons 10ns 20ns 
| | | 


cclk 


psX_csw_DataTarget_c2a<8:0> 


csw_psX_EDataGnt_c3a 


psX_csw_DataTID_c2a<5:0> 
psX_csw_HalfMask_c2a<1:0> [I 64D YT!» 
psX_csw_DataOrig c2a<2:(0> 7/7 < (Ry 
psX_csw_Data0_c2a<71:0> EK DAT[O] x DAT[2] x DAT[4] x DAT|6] QS 
psX_csw_Datal_c2a<71:0> EK DAT[1] DATS] < DAT[5] « DATT7] Qs 


CSW. 


cohx 


DataO 


c3a<71:0> [0A [0] )€l 


CSW. 


cohx 


Datat 


c3a<71:0> A (OA 1] ) 


CSW. 


cohx 


Data2 


c4a<71 (> TD DAT?) is 


CSW. 


cohx 


Data3 


c4a<71 > TD DAT|S] as 


CSW. 


cohx 


Data4 


(Sa<7 10> a OAT 4] Sa 


CSW. 


cohx 


Data 


(Sa<7 10> a OATS] Sia 


CSW. 


cohx 


Data6 


c6a<71:0> DATS] 


CSW. 


cohx 


Data7 


c6a<71:0> DAT (7 a 
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Figure 7.9: Signalling Sequence for DATA(DAT]|7:0], Px, TID, 64) 
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Ons 10ns 20ns 
| | 


cclk 


psxX_csw_EDataReq_c2a 
osX_csw_DataTarget_c2a<8:0> 
csw_psX_EDataGnt_c3a 
osX_csw_DataTID_c2a<5:0> 
osX_csw_HalfMask_c2a<1:0> I 01 Sf Ge 
osX_csw_DataOrig_c2a<2:0> > < @i 
osX_csw_Data0_c2a<71:0> i DAT, DAT GS 
osX_csw_Data1_c2a<71:0> I DAT, DAT GS 


Figure 7.10: Signalling Sequence for DATA(DAT{3:0], Px, TID, 32F) 


Ons 10ns 20ns 
| iy | eo i | i I 


cclk 


psX_csw_EDataReq_c2a 
psX_csw_DataTarget_c2a<8:0> 
csw_psX_EDataGnt_c2a 
osX_csw_DataTID_c2a<5:0> 
osX_csw_HalfMask_c2a<1:0> SR (9 #@ 
osX_csw_DataOrig_c2a<2:0> > < Qi 
osX_csw_Data0_c2a<71:0> DAT DATD@ 
osX_csw_Data1_c2a<71:0> DAT. DATD@ 


Figure 7.11: Signalling Sequence for DATA(DAT{3:0], Px, TID, 32S) 
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Ons 10ns 
| 


cclk 


psX_csw_EDataReq_c2a 
psX_csw_DataTarget_c2a<8:0> 
csw_psX_EDataGnt_c3a 
osX_csw_DataTID_c2a<5:0> 
osX_csw_HalfMask_c2a<1:0> I «00: «(|G 
osX_csw_DataOrig_c2a<2:0> I < Qi 
osX_csw_Data0_c2a<71:0> SR O41 Gi 
osX_csw_Datal_c2a<71:(> TTT 


Figure 7.12: Signalling Sequence for DATA(D, Px, TID, 8) 


May 14, 2014 361 Rev 51328 


SiCortex Confidential CHAPTER 7. L2 CACHE COHERENCE AND SWITCH 


a | 
D7 


40ns 


10 


D4 
D5 


D2 
D3 


DO 
D1 


GntC, GntD 


Ons 
1 2. 
4 
| _/ GntA GntB 
ae VCs C4 4 WC | 


> 0 a 0 Cm 0 
| OM 


8 
* 
eqD 


eq 


20ns 


« 
->|TGntReq 


ReqB 


a 3 a Oe 
> 3S a> SS (a © 
ee a 9) a 4 
ee (a (a C 
ee ho (> 5 (a Co 
i (> °) (a ©” 


x © A ®@ A A A A A A A A A 
[s) Ae 2D OD OO Or OO Or OB. - BD 
© o 0 oi _ - - - - naa - rr - 
o Vie VR N KR KR NR KR KN KR 
5 oe £€ wo VV Vv OV OV Ov OV OY 
eV ON &§ © & © BG HO SH 
roe ON NO OW tT FT DH 
s sal = “gl o, o, o, o o o o Oo 
2 A Bo un Oo + D OR 
neues BSS FB FES 
ear | 28 © &® Be BBS 
foe a Oe e 2 ee, eo 

wn = 
o ®@ 6 1 = = = = FB B BB 
| a | a a a a a a 7) a 
Co wm = Fe Fr Ss Hohe Op EO) 
— ; 8 oes © $$ © FS © BF 
a) o © £&€ © £€ E E£E E E€E 
2 i= me} ToT me) To me) me} ToT Tw 

wo me) 
E 
me) 


Figure 7.13: Signalling Sequence for DATA(DAT[7:0], DMA, TID, 64) from the DMA Engine 
Transaction A is granted in the following cycle (the fastest possible grant). Transaction B is granted after one 
stall cycle. Transactions C and D are requested and granted back-to-back. Note that the CSW samples all values 
relative to the request cycle, not the grant, and the CSW stores the content of the request until the request is 
granted. The DMA is not required to hold the data during stalls. (Other CSW bus stops have other rules.) 
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7.10.3 The Outstanding Read CAM and the Write Back CAM 


The ICE9 L2 cache system supports six processors, and DMA engine and a PCI express widget and can field 
up to 28 transactions at any one time. Each of the six processors can have one read and one write transaction 
outstanding at a time. The DMA engine and PCI widget can have four reads and four writes each at a time. 

We want to maintain memory ordering to at least an intuitive degree. That is, processors never see “time going 
backwards.” I could go on for a bunch of pages about strong consistency vs. weak consistency. Suffice it to say, 
we want memory ordering semantics that are the same as we implemented with MIPS multiprocessors. Whatever 
that is. 

We make sure that the L2 system doesn’t re-order reads and writes to the same “block” (32 bytes) relative to 
each other by chaining operations to the same block together in the outstanding read CAM and the write back 
CAM. We cover this in a fair amount of detail in the sections on read and write ordering hazards, below. In the 
transaction flows below we identify several operations on the ORC and WBC. 


ORC_Reg(X,A,T) store A — the block address, and T — the transaction ID as the keys in the ORC. Store X in 
the requester field. ORC_Reg also remembers whether the address for this entry had matched against any 
other ORC when it was first looked up. (This is used in the EXCLUSIVE to SHARED transition.) Such 
entries have their “HEAD_OF_LIST” bit set. All others have this bit cleared. 


ORC_Check(A) Lookup A in the ORC. Match only against ORC entries who’s Xd, Ad, Td, Op fields are empty. 
(i.e. those that have no dependents) 


ORC_CheckS(Tx) Lookup transaction ID Tx in the ORC. This is used by the WRSTRANS operation. 


ORC_Dep(Ty,Xd,Ad,Td,Op) find the entry matching transaction ID Ty and store a dependent operation from 
node Xd, using block address offset Ad, transaction ID Td, and Op. 


ORC_Rel(T) find the entry matching TID T, If the Xd field is not null, then there was a dependent read or block 
write operation queued up behind this read. Launch the dependent operation. Clear the valid bits for the 
matching CAM entry. (Release the entry.) 


WBC_Reg(X,A,T) store A — the block address, and T — the transaction ID as the keys in the WBC. Store X in 
the requester field. 


WBC_Check(A) Lookup A in the ORC. Match only against ORC entries who’s Xd, Ad, Td, Op fields are empty. 
(i.e. those that have no dependents) 


WBC_GetAddr(T) Lookup T in the ORC. Return the value for A at the matching location. (This is how we 
retrieve the write address that goes along with a block of data. Note that data is sent several cycles after the 
write address arrives.) 


WBC_Dep(Ty,Xd,Ad,Td,Op) find the entry matching transaction ID Ty and store a dependent operation from 
node Xd, using block address offset Ad, transaction ID Td, and Op. 


WBC_Rel(T) find the entry matching TID T, If the Xd field is not null, then there was a dependent read or 
block write operation queued up behind this read. Launch the dependent operation. Clear the valid bits for 
the matching CAM entry. (Release the entry.) Note that WBC_Rel is triggered on completion of a write with 
respect to the DDR controller or — in the case of forwarded writes — completion signalled by a BWTDONE. 
See Section 


We also perform a few operations on the L2 master tags. 


TAG_Check(A) Lookup A in the L2 master tag arrays (one for each of the 6 processor/cache segments). Return 
the state and a list of matching entries. 


TAG_Update(P,A,W,S) Create an L2 master tag entry in the tag array for processor P, in way W, with address 
A. Set the state to S. State is one of EX (exclusive), SH (shared), or IN (invalid). 


TAG_Victim(A,W) Return the address of the victim block for the L2 tag array at the index derived from A for 
way W. (For operations that include an implicit victim write, we need the address of the victim block.) 
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7.10.3.1 The ORC 


The ORC is indexed as a CAM with the Address of interest as a key. It can also be directly indexed by TID. 
Each entry in the table contains eleven fields 


Valid True if this entry represents a currently outstanding memory read transaction 

AddrTag The block address of the corresponding transaction 

Last True if this is the last memory read or write operation posted for AddrTag 

Excl True if the block was in the EXCLUSIVE state when the operation was first registered in the ORC 
Shr True if the block was in the SHARED state when the operation was first registered in the ORC 


Own The is the processor identifier of the current owner of the block if the block was SHARED and some processor 
segment claims that it is willing to source the data. (The owner, if it exists, is the last to acquire the block. 
It is possible, however, that the last acquirer has evicted the block. In this case, the OWN field points to a 
non-existedn processor segment (Oxf). 


DepTID The TID of an operation that was registered as a dependent on this entry. Valid only if Last is false. 
DepCmd The command for the dependent operation. Valid only if Last is false. 

DepAddr The low bits of the address of the dependent operation. 

DepOrg The originator of the dependent operation. 


SrcCmd The command that created this entry in the ORC. 


7.10.3.2 The WBC 
The WBC is indexed as a CAM with the Address of interest as a key. It can also be directly indexed by TID. 
Valid True if this entry represents a currently outstanding memory read transaction 
AddrTag The block address of the corresponding transaction 
Last True if this is the last memory read or write operation posted for AddrTag 
Winv True if this entry corresponds to a writeback (WINV) or victimization (RDV, RDSV) 
Shr True if this entry was in the SHARED state when it was created. 
LowBits The low bits of the address for the dependent operation. Valid only if Last is false. 
DepTID The TID of an operation that was registered as a dependent on this entry. Valid only if Last is false. 
DepCmd The command for the dependent operation. Valid only if Last is false. 
DepOrg The originator of the dependent operation. 


DepOwn The owner of the block when the dependent command was registered on this entry. 


7.10.4 Transaction Flows 
7.10.4.1 D-Stream Read to a Non Resident Block 


This is the simplest case, so we start with that. Assume that processor X launches a load operation that misses 
on block A. The operation may displace a victim block. If it does not, the operation proceeds as a simple read, 
shown in Table 7.6. If a victim write back is required, the operation is described in Table 7.7. 

Note that during the command processing phase of the transaction (cycles 2 and 3) the address is first looked 
up in the master tags, the writeback CAM (WBC) and the outstanding read CAM (ORC). In the second of the 
two cycles, we update the tags, the WBC, and the ORC. In the latter case, the update to the CAM array occurs 
at the start of the cycle, so comparisons to the new CAM entries can occur immediately. The tag arrays, however, 
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TAG_Check(A) - no 
hit found. 
WBC_Check(A) - no 
hit found. 
ORC_Check(A) - no 
hit found. 

Send A to DDR Con- 
troller and queue for 
DDR_ Read _ opera- 
tion. 
ORC_Reg(PX, 
Tx) 
TAG_Update(PX, A, 
W, EX) 
DATA(X,Tx,D) — re- 
turn Data to Px. 
ORC_Rel(Tx) 


A 


9 


Data is returned in 


“best word first” or- 
der. 


Px can now launch 
a new read operation 


fo 
as soon as the first 


N 
data word arrives. 


SI 


Table 7.6: D-Stream Read to a Non Resident Block: No Victim Writeback 


are implemented as RAMs and so we must implement a comparison bypass to allow two back to back operations 
on the same block address to work properly. 

Note that the difference between an RDEX and RDV is entirely found in the writeback operation starting with 
the WBC_Reg in cycle 3, and including the data write cycles beginning in cycle M. Writebacks never stall. That 
is, the result of tag lookups in the L2 tags, WBC, or ORC has no effect on the writeback or its time of arrival. For 
this reason, we will show only a few examples of the writeback version of the transaction flows. (For a discussion of 
the only really interesting thing that can happen to a writeback, see the description of victim writeback collisions 
against BWT operations in Section 7.10.4.19.) 
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ou COW Action 
CMD(RDV,COHn,Tx,A,W) 


DATA(COH,Tx,Dw) 
CMD angen ee 


Some Drm Iai |S 


TAG_Check(A) - no hit found. 
WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found. 
Send A to DDR Controller and 
queue for DDR Read operation. 
Av = TAG_Victim(A,W) 
ORC_Reg(PX, A, Tx) 
WBC_Reg(PX, Av, Tx) 
TAG_Update(PX, A, W, EX) 


Data or WBCAWN arrives at COH. 
Aw = WBC_GetAddr(Tx) 

Send Aw along with the data Dw 
to the DDR controller. 
WBC_Rel(Tx) 

DATA(X,Tx,Dr) 

ORC_Rel(Tx) 


If a WBC hit is found here, it 
must be against a BWT. 


We remember that there is a 
write outstanding since the data 
may not arrive for some time. 
The WBC allows us to buffer the 
write address to send along with 
the data, and to protect against 
“ships passing in the night.” See 
Section 7.10.4.17. 

If Av matches an outstanding 
BWT, then we write Av = NULL 
in the WBC. 


Cycle M may be coincident with 
cycle 3. 


If Aw is NULL, then this write- 
back was killed by an intervening 
BWT request. 


Data is returned in “best word 
first” order. 

Px can now launch a new read op- 
eration as soon as the first data 
word arrives. 
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TAG_Check(A) - no hit found. 
WBC_Check(A) - no hit found. 
Tv = ORC_Check(A) - HIT! 
Send A to DDR Controller and 
queue for DDR Read operation. 


Shootdown A in DDR controller. 
ORC_Reg(PX, A, Tx) 
TAG_Update(PX, A, W, EX) 
ORC_Dep(Tv, PX, A, Tx) 


DATA(DEV,Tv,Dr) 


M+2 Sead addiess A to Aon con- 
troller. 


Continue at step N in Table 7.6 


If the TAGS are all clear, then the 
read that we’re depending on is 
a BRD to an uncached location 
from the DMA engine or PCI. 


Register our dependence on the 
earlier read operation. 


Data is returned by the DDR con- 

troller. It is not possible for this 

sequence to end with a forwarded 

read acknowledged by DMA/PCI 
(PRBDONE). 
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Shootdown A in DDR controller. 
ORC_Reg(PX, A, Tx) 
TAG_Update(PX, A, W, EX) 
WBC_Dep(Tv, PX, A, Tx) 


a a a ae | | 
TAG_Check(A) - no hit found. 
Tv = WBC_Check(A) - HIT! 
ORC_Check(A) - no hit. 
Send A to DDR Controller and 
queue for DDR Read operation. 


DATA(X,Tv,Dr) 


a PSO) 


Send address A to DDR con- 
troller. 


Continue at step N in Table 7.6 


Ifthe TAGS are all clear, then the 
write that we’re depending on ei- 
ther a victim writeback or a BWT 
to an uncached location. 


Register our dependence on the 
earlier write operation. 


Data is returned to the DDR con- 
troller. 


Teyuepyuoy xoyIODIS 


HOLIMS CNV GONTYAHOO AHOVO GT “2 UALdVHO 


SiCortex Confidential 7.10. DETAILED INTERFACE AND BLOCK DESCRIPTIONS 


7.10.4.2 D-stream Read to a Cached Block 


This is where things get interesting. Consider again the case of a processor X reading block A. In this case, 
we assume that block A is already resident in some other cache — processor Y for example. Our cache coherence 
scheme allows a block to be in one of three states: INVALID, EXCLUSIVE, or SHARED. (The SHARED state is 
implemented for i-stream cache blocks only. This section will describe accesses to a block that is in the EXCLUSIVE 
state or the INVALID state. For D-stream accesses to blocks in the SHARED state, see Section 7.10.4.5.) In the 
first case, described in Table 7.10, processor X does not require a victim write back (block A is replacing an 
INVALID, SHARED, or EXCLUSIVE-CLEAN block). In the second case, described in Table 7.11, processor X 
must write back a victim block. 


May 14, 2014 369 Rev 51328 


PLO ‘VT ACI 


OLE 


8ZETS AVY 


PRIN WTA ON — VIC poyeD Jo peoy wWeerg- OTL MeL 


COW Action 
ie CMD(RDEX, COHn, Tx, A, W) 


3 
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M+1 


M+2 


Receive data from bus, write to 
L2/L1. Set to EXCLUSIVE- 
DIRTY if d was true. Set to 
EXCLUSIVE-CLEAN otherwise. 
New read operation can be 
launched as soon as the first 128 
bits of data arrives. 
CMD(PRBDONE, COH, Tx, 
addr=0) 


TAG_Check(A) - return PY, EX 
WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found 
Send A to DDR controller and 
queue for DDR read operation. 
CMD(PRBWIN,PY,Tx,A) 
TAG_Update(PX, A, W, EX) 
TAG_Update(PY, A, W, IN) 
ORC_Reg(PX, A, Tx) 

Send “shootdown” signal to DDR 
to cancel DDR read of A. 


s[ SOR RATR 


Look up A in L2 tags. Find a hit. 
Send A to Li for 
probe/writeback. 

Copy data from dirty 32 byte 
blocks from L1 into 64 byte L2 
block (update if the L1 entry was 


DATA(PX,Tx,D,d) —return data 
to PX. d is true if block A was 
EXCLUSIVE-DIRTY. 
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PX —EE 


DATA(COHn,Tx,Dw) — write- 
back victim block 
or 


CMD(WBCANCEL,Tx) 


Receive data from bus, write to 
L2/L1. Set to EXCLUSIVE- 
DIRTY if d was true. Set to 
EXCLUSIVE-CLEAN otherwise. 
New read operation can be 
launched as soon as the first 128 
bits of data arrives. 


CMD(PRBDONE,COH, Tx,addr=) 


) 
Eee eens ee aca! ORC_Rel(Tx) 


TAG_Check(A) - return PY, EX 
WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found 
Av = TAG_Victim(A, W) 

Send A to DDR controller and 
queue for DDR read operation. 
CMD(PRBWIN,PY,Tx,A) 
TAG_Update(PX, A, W, EX) 
TAG_Update(PY, A, W, IN) 
ORC_Reg(PX, A, Tx) 
WBC_Reg(PX, Av, Tx) 

Send “shootdown” signal to DDR 
to cancel DDR read of A. 


Aw = WBC_GetAddr(Tx) 

Send Aw along with the data Dw 
to the DDR controller. 
WBC_Rel(Tx) 


Look up A in L2 tags. Find a hit. 
Send A to Ll for 
probe/writeback. 

Copy data from dirty 32 byte 
blocks from L1 into 64 byte L2 
block (update if the L1 entry was 


DATA(PX,Tx,D,d) — return data 
to PX. d is true if block A was 


COW Action 


W is the way that we’ll displace 
and the target way for A. 


L may be as early as cycle 3, but 
there may be queueing delay at 
PY’s command input. 


Cycle M may occur as early as cy- 
cle 3. This activity may run in 
parallel with other parts of this 
transaction. 
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At times, the PWIN arriving at PY will result in PY finding that the data is no longer in its cache. (PY 
can autonomously evict an EXCLUSIVE block that is clean, without informing the COH. (This can also happen 
because of a race between a victimization by PY and a read by PX. See Section 7.10.4.17.) In this case, PY, 
upon receiving the PWIN command will send a PRBNOHIT command to PX with the original TID. PX will then 
requeue the Read operation as a REREAD(X,A) and the transaction will proceed as shown in Table 7.12. The 
table picks up the transaction at cycle L. 


COW Action 


CMD(PRBNOHIT, 
PX, Tx, addr=0) 


CMD(RDEXR, 
COHn, Tx, A) sive Retry command. 
K could be as early as 


R may be many cycles 
after L+3. 


Receive data from 
bus, write to L2/L1. 
Set to EXCLUSIVE- 
CLEAN. 


Table 7.12: Forwarded D-Stream Read Misses in Probed Cache 
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pose CMD(RDEX, COMn, Tx, A, W) 


TAG_Check(A) - return PY, EX 
WBC_Check(A) - no hit found. 
Tv = ORC_Check(A) - HIT! 
Send A to DDR controller and 
queue for DDR read operation. 
TAG_Update(PX, A, W, EX) 
TAG_Update(PY, A, W, IN) 
ORC_Reg(PX, A, Tx) 
ORC_Dep(Tv, Px, A, Tx) 

Send “shootdown” signal to DDR 
to cancel DDR read of A. 


ORC hit is either on PY doing the 
initial read that fills this block in 
PY or on a BRD to PY. (Oth- 
erwise, the state wouldn’t be PY 
EXCLUSIVE.) 

Register this transaction as de- 
pendent on an earlier read trans- 
action with TID = Tv. 


Teuepyuoy xaVIOODIS 


DATA(DEV, Ty, D) CMD(PRBDONE, COHn, Tv, | Either read data is supplied by 
addr=0) DDR to PY or PY completed via 
an inter-cache transfer. 
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+3 Continue with step L in Table 7.10 
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COW Ration DEV Action 
oan CMD(RDEX, COHn, Tx, A, W) 


+3 


———— 

TAG_Check(A) - return PY, EX 

ORC_Check(A) - no hit found. 

Tv = WBC_Check(A) - HIT! 

Send A to DDR controller and 

queue for DDR read operation. 

TAG_Update(PX, A, W, EX) 

TAG_Update(PY, A, W, IN) 

ORC_Reg(PX, A, Tx) 

ORC_Dep(Tv, Px, A, Tx) 

Send “shootdown” signal to DDR 

to cancel DDR read of A. 
CMD( ee COHn, Tv, 
addr= 


Continue with step L in Table 7.10 


WBC hit against a block write 
operation to processor PY. 


Register this transaction as de- 
pendent on an earlier read trans- 
action with TID = Tv. 
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7.10.4.3  I-stream Read to a Non Resident Block 


ICE9 supports cache coherency via an exclusive writer model. That is, the cache does not support a “shared- 
update” operation where one processor is able to write a few bytes through and update cache blocks in other 
processors. It isn’t that we don’t like shared-update protocols, it’s just that such protocols are really hard to verify 
and hard to retrofit to a processor pipeline that was built for a simpler model. 

But we do want to share I-stream data among the caches. So, we implement a SHARED state in the cache. 
Blocks in the SHARED state can’t be written. They only get into the shared state as the result of an I-stream L1 
cache miss. 


COW Ration 


ORC_Reg(PX, A, Tx). 
TAG_Update(PX, A, W, 


DATA(PX,Tx,Di) 
ORC_Rel(Tx) 


Receive data from bus, write 
into L2 and L1 ICache. Set 
state to SHARED. 


Table 7.15: I-Stream Read to a Non Resident Block 
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COW Action 
CMD(RDSV,COHn, Tx,A,W) 


M DATA(COHn,Tx,Dw) or 
CMD(WBCANCEL,Tx) 


Receive data from bus, write into 
L2 and L1 ICache. Set state to 
SHARED. 


ORC_Reg(PX, A, Tx) 
WBC_Reg(PX, Av, Tx) 
TAG_Update(PX, A, W, SH) 


Pee oe Dena eee ee ee le 
TAG_Check(A) - no hit found. 
WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found. 
Send A to DDR controller 
Av = TAG_Victim(A, W) 


This is the victim writeback. 
M may occur as early as cycle 3. 


Aw = WBC_GetAddr(Tx) 

Send Aw along with the data Dw 
to the DDR controller. 
WBC_Rel(Tx) 


DATA(PX,Tx,Di) DDR returns data. 


ORC_Rel(Tx) 
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COW Ketion DMA/PCI Notion 
CMD(RDS,COHn,Tx,A,W) 
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Receive data from bus, write into 
L2 and L1 ICache. Set state to 
SHARED. 


TAG_Check(A) - no hit found. 
WBC_Check(A) - 
Ty = ORC_Check(A) - HIT! 
Send A to DDR controller 
Av = TAG_Victim(A, W) 
Shoot down address A in DDR. 
ORC_Reg(PX, A, Tx) 
ORC_Dep(Ty, PX, A, Tx) 
TAG_Update(PX, A, W, SH) 


DATA(PY, Ty, D)— OR 


PX, tx, A = ORC_Rel(Ty) 
Send A to DDR controller 


DATA(PX,Tx,D) 
ORCL na Tx) 


no hit found. 


CMD(PRBDONE, COHn, 


addr=0) 


Ty, 


This can only happen if a block 
is uncached and then fetched by 
the DMA/PCI widget via a BRD 
operation. 


Reads are serviced “in order” even 
when it “doesn’t matter.” 


Data is returned by DDR or 
DMA/PCI completes a probe to 
get the data. Either way, COH 
finds out about it. 


Note contention between this 
source of addresses and the in- 
coming cmd/addr stream. 

DDR returns data. 
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CMD(RDS,COHn,Tx,A,W) 


Receive data from bus, write into 
L2 and L1 ICache. Set state to 
SHARED. 


TAG_ egy as no hit found. 
Ty = WBC_Check(A) - HIT! 
Ty = ORC_Check(A) - no hit. 
Send A to DDR controller 
Av = TAG_Victim(A, W) 
Shoot down address A in DDR. 
ORC_Reg(PX, A, Tx) 
WBC_Dep(Ty, PX, A, Tx) 
TAG_Update(PX, A, W, o 


PX,Tx, A = WBC_Rel(T 
Send A to DDR euaeee 


We’re queued up behind a BWT 
or a victim writeback. 


DATA(PY, Ty, Dw) Data is returned by DDR or 
DMA/PCI completes a probe to 
get the data. Either way, COH 


finds out about it. 
Note contention between this 


source of addresses and the in- 
coming cmd/addr stream. 


DATA(PX,Tx,D) 
ee (Tx) 
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7.10.4.4 I-stream Read to a Cached Block 


If an I-stream miss finds the object L2 block in the EXCLUSIVE state, we face something of a problem. If the 
block is DIRTY, then we need to write the bits in the block back to memory before changing the state of the block 
to SHARED. (If we don’t write the bits to DRAM, and the only copies of this dirty data are in the SHARED state, 
then the bits may be lost. SHARED blocks can be evicted without being written back.) So, we need to ensure 
two things. First, that the current owner flushes any dirty data in the block out to main memory. Second, that A 
eventually arrives at the requesting processor. We do this with a special writeback operation. When PY flushes its 
data to the coherence widget, the COH will look up the write address, as it always does, in the ORC and WBC. It 
will find a hit in the ORC. Normally writes don’t hit in the ORC, as there are ownership issues at stake here. This 
write, however, looks like a block write to a cache block that is owned exclusively (except that the current “owner” 
hasn’t seen the data yet.) So we’ll leverage the machinery we have sitting around for block writes from cacheless 
widgets, as described in Sections 7.10.4.8 and 7.10.4.9. 
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WBC_Check(A) - no hit found. should be in the SHARED state. 
ORC_Check(A) - no hit found. 

Send A to DDR controller 

Av = TAG_Victim(A, W) 


be 
CMD(PRBSHR, PY, Tx, Send a probe/intervention to PY, 
ORIGIN=Px) asking for block A to be stored in 
Shoot down read of A in DDR the SHARED state. 
controller. 
ORC_Reg(PX,A,Tx) 
TAG_Update(PX, A, W, SH) 

| L+1 | 


TAG_Check(A) - If no hit, see Ta- | If A does hit in PY’s L2, the state 
ble 7.25. should be SHARED. If not, see 
Table 7.22. 


2 | Receive data from the bus, write 
to L2 and L1 ICache. Set state to 
SHARED. 


1 

2 

3 
L 
L+1 
Lt 
L+3 
L4 


CMD(PRBDONE,COHn,Tx, 
addr=0) 
Piet SS—S ORT RAT 
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COW Action 
omy CMD(RDSV,COHn,Tx,A,W) 


L+ 
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DATA(COHn,Tx,Dw) or 
CMD(WBCANCEL,Tx) 


Receive data from the bus, write 
to L2 and L1 ICache. Set state to 
SHARED. 


TAG_Check(A) - Find at least 
one match, pick PY. 
WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found. 
Send A to DDR controller 

Av = TAG_Victim(A, W) 
CMD(PRBSHR, PY, Tx, 
ORIGIN=Px) 

Shoot down read of A in DDR 
controller. 

ORC_Reg(PX, A, Tx) 
WBC_Reg(PX, Av, Tx) 
TAG_Update(PX, A, W, SH) 


Aw = WBC_GetAddr(Tx) send 
Aw along with the data Dw to the 
DDR controller. 
WBC_Rel(PX,Av,Tx) 


Lookup A in L2 tags. 


see Table 7.25. 


If no hit, 


If there is more than one hit in the 
L2 master tags, then all blocks 
should be in the SHARED state. 


Send a probe/intervention to PY, 
asking for block A to be stored in 
the SHARED state. 


If A does hit in PY’s L2, the state 
should be SHARED. If not, we’ve 
got a problem. 


Po DATA(PX,Tx,D) Send data to processor X 


CMD(PRBDONE,COFn, Txaddr 
ee LORCRATPRC A SY Rom ORO] 
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COW Action PZ and PY Action 
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CMDRDSCOmIsAW) [SSS 


| 4 Set eee 


one match, pick PY. 
WBC_Check(A) - no hit found. 
CMD(PRBSHR, PY, Tx, 
ORIGIN=Px) 

Receive data from the bus, write 

to L2 and L1 ICache. Set state to 


Tz = ORC_Check(A) - HIT! 
—————| 
aT a 
a a ef S 


Send A to DDR controller 
Shoot down read of A in DDR 
controller. 
ORC_Reg(PX, A, Tx) 
ORC_Dep(Tz, Px, A, Tx) 
TAG_Update(PX, A, W, SH) 
0 
a ORCAS 


DATA(Pz, Tz, D) or 


1 
2 


Px, Tx, A = ORC_Rel(Tz) 


If there is more than one hit in the 
L2 master tags, then all blocks 
should be in the SHARED state. 


PZ: CMD(PRBDONE, COMn, 
Tz, addr=0) 


Register dependency of Tx on Tz. 


One way or the other, Tz com- 
pletes — either by getting data 
directly from the DDR or for- 
a from —ewrewrrw— 


PY: DATA(Px, Tx, D) PY Returns data to PX. 


_—— i 
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COW Ketion 


1 


om 
+ 
iw) 


+ 
a 


E 


+ 
— 


CMD(RDS,COHn,Tx,A,W) 


Receive data from the bus, write 
to L2 and L1 ICache. Set state to 
SHARED. 


TAG_Check(A) - Find exactly 
one match for PY in EXCLU- 
SIVE state. Save matching way 
in Wy. 

WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found. 
Send A to DDR controller. 
CMD(PRBSHR, PY, Tx, 
ORIGIN=Px) 

Shoot down read of A in DDR 
controller. 

ORC_Reg(PX, A, Tx) 
TAG_Update(PX, A, W, SH) 
TAG_Update(PY, A, Wy, SH) 


WBC_Reg(Py, Ay, Ty) 
WBC_Dep(Ty, Px, A, Tx, RDS) 


Data arrives at COH. Send Dw 
with Ay to DDR controller. 
WBC_Rel(Ty) 
DATA(Px,Tx,Dw) 
ORC_Rel(PX, A, Tx) 


Lookup A in L2 tags. If no hit, 
see Table 7.25. Probe the L1 
blocks and commit L1 updates to 
the L2 copy. 
CMD(WRSTRANS, COHn, Ty, 
Ay, Origin=Tx) 

Set the state of the L2 copy to 
SHARED. 


This can happen after a I-stream 
page has been written by the OS 
or a virus. It would be humiliat- 
ing to get the wrong answer while 
executing a virus. 


Send a probe/intervention to PY, 
asking it to invalidate the block. 
PY will see that the block is 
EXCL, flush its writes, and will 
send the data to the COH even if 
it is clean. Both PY and PX will 
keep the block in SHARED state. 
If A does hit in PY’s L2, the state 
should be EXCLUSIVE. If not, 
we’ve got a problem. 


Send a writeback and transfer 
command to COH. Note that 
we write the data to memory 
whether it is clean or dirty. It 
just isn’t worth optimizing for 
this case. 

Find the “first” outstanding 
ORC entry — that’s the one 
that we need to chain on this 
WRSTRANS 


Send data to the coherence wid- 
get. Could occur in the same cy- 
cle as L+3. 

(This is what we’d do for a RAW 
hazard. See Section 7.10.4.17.) 
Coherence controller forwards 
read data from DDR. 
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COW Action 


PT CoD RDSV. COTA) 


DATA(COHn,Tx,Dv) or 


CMD(WBCANCEL,Tx) 


TAG_Check(A) - Find exactly one match 
for PY in EXCLUSIVE state, way Wy. 
WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found. 

Send A to DDR controller 

Send Av (address of victimized block) to 
WBC. 

CMD(PRBSHR, PY, Tx, 
GIN=Px) 

Shoot down read of A in DDR controller. 
ORC_Reg(PX, A, Tx) 

WBC_Reg(PX, Av, Tx) 
TAG_Update(PX, A, W, SH) 
TAG_Update(PY, A, Wy, SH) 


A, ORI- 


Av = WBC_GetAddr(Tx). Send Av 
along to the DDR controller (along with 
the data) 

WBC_Rel(PX, Av, Tx) 


This can happen after a I-stream 
page has been written by the OS 
or a virus. It would be humiliat- 
ing to get the wrong answer while 
executing a virus. 


Send a probe/intervention to PY, 
asking for block A to be stored in 
the SHARED state and for PY to 
EVICT the block. 


Cycle M may occur as early as 
cycle 3. This activity may run 
in parallel with other parts of the 
transaction 
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PX Action 


Receive data from the bus, 
write to L2 and L1 I[Cache. 
Set state to SHARED. 


COW Action 


ORC_CheckS(Tx) to find Ax, Tx. 
ward this indication to WBC. 


WBC_Reg(PY, Ay, Ty) 
WBC_Dep(Ty, PX, Ax, Tx, RD) 


Data arrives at COH. Send Dw with Ay 
to DDR controller. 
WBC_Rel(Ty) 


DATA(Px,Tx,Dw) 
ORC_Rel(PX, A, Tx) 


Lookup A in L2 tags. If no hit, 
see Table 7.25. Probe the Ll 
blocks and commit L1 updates to 
the L2 copy. 
CMD(WRSTRANS, COHn, Ty, 
Ay, Origin=Tx) 

Set the state of the L2 copy to 
SHARED. 


If A does hit in PY’s L2, the state 
should be EXCLUSIVE. If not, 
we’ve got a problem. 


Send a writeback and transfer 
command to COH. Note that 
we write the data to memory 
whether it is clean or dirty. It 
just isn’t worth optimizing for 
this case. 

In this case OTC_CheckS(A) will 
match against the first OTC en- 
try that was independent of any 
other OTC or WBC entry. 


Send data to the coherence wid- 
get. This could be as early as 
L+2. 

Enqueue Read operation for 
A,Tx to DDR controller. (This is 
what we’d do for a RAW hazard. 
See Section 7.10.4.17.) 
Coherence controller 
read data from DDR. 


forwards 
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PX |}PX Action —————sSY 


| COH Action Action 


[PY Adin Sd 


Lookup ee in L2 tags and find | This is the continuation of the op- 
NOHIT. erations in Tables 7.22 and 7.23. 


CMD(RDSR,COHn,Tx,A,W) 


L+ 


4 


Process PRBNOHIT, lookup Tx 
and find target address and way. 
Retry the read operation. 


to L2 and L1 ICache. Set state to 
SHARED. 


This is a retry, don’t do any tag 
matching or ORC/WBC lookups, 


as we don’t really care. (And 
we've already got an ORC regis- 
tered for this read.) 

Send A on to DDR controller. 


Read data arrives at COH from 
DDR. 

ORC_Rel(PX,A,Tx) 
DATA(Px,Tx,D) 


CMD(PRBNOHIT,PX,Tx, Send a nohit notification back to 
processor X. Address and Way 


are irrelevant. 
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COW Action 
CMD(RDSCOMmTSAW) | SSCS~—S 


po 

TAG_Check(A) - Find exactly This can happen after a I-stream 

one match for PY in EXCLU- page has been written by the OS 

SIVE state. Save matching way or a virus. It would be humiliat- 

in Wy. ing to get the wrong answer while 

WBC_Check(A) - no hit found. executing a virus. 

Py, Ty = ORC_Check(A) HIT. We got an ORC hit because the 

Send A to DDR controller. EXCLUSIVE owner hasn’t yet 
received the block. We need to 
delay sending the PRB until the 
block arrives. 


Shoot down read of A in DDR Update the blocks to SHARED, 
controller. that’s what they’ll be once we’re 
ORC_Reg(PX, A, Tx) done. 

ORC_Dep(Ty, Px, A, Tx, RDS) 

TAG_Update(PX, A, W, SH) 

TAG_Update(PY, A, Wy, SH) 


CMD(PRBDONE, COHn, Py, | PY finally gets the block it re- 

Find the dependent read opera- 
eee 
CMD(PRBSHR, PY, A, Tx, Ask PY to send data to PX and 


Continue at step L in Table 7.22 


Px, A, Tx = ORC_Rel(Ty) 
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COW Action DEV Action 


ee CMD(RDS,COHn,Tx,A,W) 


SIVE state 
in Wy 


ORC_Check(A) 


DEV, Tv 
HIT 


Shoot down read of A in DDR 


controller. 


ORC_Reg(PX, A, Tx) 
WBC_Dep(Tv, Px, A, Tx, RDS) 
TAG_Update(PX, A, W, SH) 
TAG_Update(PY, A, Wy, SH) 


TAG_Check(A) - Find exactly 
one match for PY in EXCLU- 
3 . Save matching way 
_ - no hit found. 
; WBC_Check(A) 
Send A to DDR controller. 


CMD(BWTDONE, COHn, Ty, 
addr=0, ORIGIN=Py) 


Px, A, Tx, Py = WBC_Rel(Tv) 


ORIGIN 


) 


=Px 


Continue at step L in Table 7.22 


This can happen after a I-stream 
page has been written by the OS 
or a virus. It would be humiliat- 
ing to get the wrong answer while 
executing a virus. 

We got a WBC hit because the 
EXCLUSIVE block is being up- 
dated by a BWT instruction from 
a DMA/PCI widget. 


PY eeely gets the block it re- 
Find the dependent read opera- 
Ask PY to send data to PX and 
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7.10.4.5 D-stream Read to a Cached Block in SHARED State 


A D-stream read to a block in the SHARED state is a surprise. That is, this is a hint that some process has 
decided to treat someone’s I-stream as data. We need to get this right, but we don’t need to make this fast. In any 
case, this isn’t rocket science. The trick here is that we need to make the coherence engine send out invalidates to 
each of the processors that might have or be acquiring copies of the istream data. Note that victimizing a SHARED 
block in an L2 does not invalidate the L1 I-cache copy. It is the responsibility of the operating system to see that 
L1 I-caches remain coherent to the extent it is required. In practical terms, this means that the OS must flush the 
I-cache when it modifies the I-stream of a process. 

Our general approach here is that the COH will send out a broadcast PRBINV command to all caches, directing 
them to INVALIDATE the target block in their L2 caches. If any processor (other than the requestor) finds the 
L2 block in the EXCLUSIVE state, we’re in trouble and we should signal a machine check. 
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COW Action Other PY Action 
1 | CMD(RDEX, COHn, Tx, A, W) 


2 


7 


M4 
M4 


+1 
+2 


Receive data from the bus. 
Store it in L2 and set state to 
EXCLUSIVE-CLEAN. 


Tag_Check(A) (All matches are 
in the SHARED state.) 
WBC_Check(A) — Always misses. 
ORC_Check(A) — if a hit is found, 
see Table 7.29. 

Send address A to DDR con- 
troller and queue for DDR read 
operation. 

ORC_Reg(PX, A, Tx) 
TAG_Update(Px, A, W, EX) 
For all matching PY: 
TAG_Update(PY, A, Wy, INV) 
CMD(PRBINV, BROADCAST, 
Tx, A) 


Data Dr returns from DDR. 
ORC_Rel(Tx) 
DATA(X, Tx, Dr) 


Lookup A in L2. Set any 
matching blocks to “INVALID”. If 
any matching blocks are EXCLU- 
SIVE, signal a machine check. 
All processors send 
CMD(INVDONE, COHx, Tx, 
A). 


If one or more of the matching 
blocks is not in the SHARED 
state, then we should signal a ma- 
chine check. 

Since the block is SHARED, 
there can be no write transactions 
outstanding. 


All PYs are told to invalidate this 
block in their caches if necessary. 


INVDONE must be received at 
the COH from each processor in 
order to free up the TID Tx. 
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COW Action Other PY Action 


= 


M+1 


CMD(RDEX,COm, TSA W) [SSS 


Receive data from the bus. 
Store it in L2 and set state to 
EXCLUSIVE-CLEAN. 


Tag_Check(A) (All matches are 
in the SHARED state.) 
WBC_Check(A) — No hit. possi- 
ble. 

Py,Ty = ORC_Check(A) 

Send address A to DDR con- 
troller and queue for DDR read 
operation. 

Shootdown address in DDR. 
ORC_Reg(PX, A, Tx) 
ORC_Dep(Ty, Px 
RDEX) 
TAG_Update(Px, A, W, EX) 
For all matching PY: 
TAG_Update(PY, A, Wy, INV) 


A, Tx, 


? 


Px,A,Tx,Opx, 
ORC_Rel(Ty) 
CMD(PRBINV, BROADCAST, 
Tx, A) 

Launch address A to DDR 
Data returns from DDR 
ORC_Rel(Tx) 

DATA(X, Tx, Dr) 


Opy 


Data returns from DDR OR PRBDONE arrives from PY 


processors 


au INVDONE, COHx, 
A). 


send 
Tx, 


If one or more of the matching 
blocks is not in the SHARED 
state, then we should signal a ma- 
chine check. 

Note that now there may be two 
addresses in flight 


All PYs are told to invalidate this 
block in their caches if necessary. 


Yes, we did two fetches. That’s 
how this works. Otherwise the 
ORC entries retire out of order. 
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DATA(COHn, Tx, Dw) or 
CMD(WBCANCEL, Tx) 


Receive data from the bus. 
Store it in L2 and set state to 
EXCLUSIVE-CLEAN. 


TAG_Check(A) (All matches are 
in the SHARED state.) 
WBC_Check(A) — nohit. 
ORC_Check(A) — no hit. 

Send address A to DDR con- 
troller and queue for DDR read 
operation. 

Av = TAG_Victim(A, W) 
ORC_Reg(Px, A, Tx) 
WBC_Reg(PX, Av, Tx) 
TAG_Update(PX, A, W, EX) 
For all matching PY: 
TAG_Update(PY, A, Wy, INV) 
CMD(PRBINV, BROADCAST, 


Av = WBC_GetAddr(Tx) 

Send Av along with data Dw to 
the DDR controller write queue. 
WBC_Rel(PX, Av, Tx) 


Lookup A in L2. 

matching blocks to “INVALID”. If 
any matching blocks are EXCLU- 
SIVE, signal a machine check. 
All processors 
CMD(INVDONE, COHx, 


COW Ration Other PY Action 


If one or more of the matching 
blocks is not in the SHARED 
state, then we should signal a ma- 
chine check. 


All PYs are told to invalidate this 
block in their caches if necessary. 
BROADCAST is a special Target 
vector that ensures this command 
arrives at every port’s command 
queue. (Note that we don’t limit 
the broadcast to processors that 
have the data.) 


W may occur as early as cycle 3. 
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7.10.4.6 D-Stream Write Miss 


D-Stream writes from a processor that miss in its L2 cache require that the L2 segment acquire ownership of 
the relevant block before the write can complete. Thus, there really isn’t a notion of a “D-Stream Write Miss” as 
L2 write misses become D-Stream Read Miss events described in Sections7.10.4.1, 7.10.4.2, and 7.10.4.5. 


7.10.4.7 D-Stream Write to Invalidate 


A processor may flush a block from its L2 segment without asking for a refill. In this case, the processor will 
issue a WINV command as shown in Table 7.31. 

If the block to be flushed is clean, then there is no need to send data. In this case, the processor will issue a 
FLUSH command as shown in Table 7.32. Note that the FLUSH operation is not implemented in the 
ICE9 chip. None of the nodes in the chip uses the FLUSH operation. 
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DATA(COHn,Tx,Dw) —  write- 
back victim block 


TAG_Check(A) - Hits on PX. 


TAG_Update(PX, A, W, IN) 
WBC_Reg(PX, A, Tx) 


Aw = WBC_GetAddr(Tx) 

Send Aw along with the data Dw 
to the DDR controller. 
WBC_Rel(Tx) 


COW Action 


CMD(WINV,COHn, Tx,A,W) 


7 
[a 


W is the way that we’ll invalidate. 
This must match the comparison 
that will happen for A in the mas- 
ter tags. 

If there is no tag hit then we’ve 
passed a read operation on this 
block. PX will return a PRBNO- 
HIT to the other ship. So PX 
needs to write the data back to 
DDR. 


Cycle M may occur as early as cy- 
cle 3. This activity may run in 
parallel with other parts of this 
transaction. 
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OBSOLETE THIS OPERATION IS NOT IMPLEMENTED IN THE ICE9 V1.0 CHIP ———_- 


COM Action 


CMD(FLUSH,COHn, Tx,A,W) 


W is the way that we’ll invalidate. 
But we already know that from 
the comparison that will happen 
for A in the master tags. 


2 TAG_Check(A) - Hits on PX. If there is no tag hit then we’ve 
passed a read operation on this 
block. PX will return a PRBNO- 
HIT to the other ship. 


3 PAR pdate(PX, A, W, IN) | Don’t tell anybody else. 
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7.10.4.8 Block Write to a Non Resident Block 


As opposed to D-stream Write misses from a processor, I/O and the DMA engine (which we'll also call an I/O 
device, even though it isn’t) may write entire blocks of memory. In this case, we know that all 64 bytes are being 
written, so there is no need to perform a read of the block and merge in just the changed bytes before the writeback. 

On the other hand, we really really want to optimize the path that carries data from a packet buffer in the 
DMA engine to a processor that will consume it. For that reason, we distinguish block writes that are performed 
by cacheless device like the DMA engine from those performed by a processor. The “trick” that we’re about to 
employ here would not be appropriate for processors, as the three-stage writeback (a relatively frequent operation) 
would be a bottleneck for the processors, as they’re only allowed one read and one write transaction outstanding 
at any given time. 

So, Tables 7.33 and 7.34 show how a cacheless node on the CSW performs block writes to non resident data. 


| Device Action 


| COH Action Action 


| Comment sd 


[maT CMD(BWT,COHn,Ty,A) SANTEE Seatac Block write from device 
“DEV” 


TAG_Check(A) — no_ hit 
found. 
WBC_Check(A) — If there is 


A hit in the WBC is likely 
the result of a victimization 
write from some processor, 


a hit here, see Table 7.34. or — less likely — a colliding 
ORC_Check(A) — If there is | write from the DMA engine 
a hit, see Table 7.35. or the PCI widget. 
A hit in the ORC is the re- 
sult of an outstanding BRD. 


WBC_Reg(DEV, A, Tv) Tell the device to complete 


CMD(BWTGO, DEV, Tv, | its write operation 
A) 


Device receives 

command, matches 
against outstanding data 
block to be written. 
DATA(COHn,Tv,Dw) — 
send write data block to the 
coherence widget 


Receive incoming write 
data. 

A = WBC_GetAddr(Tv) 
Send matching A address to 
DDR controller along with 
the data. 

WBC_Rel(DEV, A, Tv) 


Table 7.33: Block Write to a Non Resident Block 
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Kosai COW Action [ Other Device Action | 


CMD(BWT, COHn, 

Ty, 7 
TAG_Check(A) — 
found. 
Ty, Py = WBC_Check(A) — 
We find a HIT 
ORC_Check(A) 
be no ORC hit. 
WBC_Reg(DEV, A, Tv) 
WBC_Dep(Ty, DEV, A, Ty, 
BWT) 


no hit 


— There can 


WBC_ a 

This causes the dependent 
write from DEV to be acti- 
vated. 

Remove the entry for Ty 
from the WBC. 

(cian aceon Dev,Tv,A) 


PY eer! evicted the block, or 
another device has launched 
a write to this block. 


We'll write the data di- 
rectly to the DDRAM after 
the victimization write com- 


LC pletes. 
ee a COH,Ty,Dy) | Other device writes its data 
to the DDR. 


Contine as at cycle W in Ta- 
ble 7.33. 


Table 7.34: Block Write to a Non Resident Block with a Writeback in Flight from Processor Y 


| Other Device Action | Device Action 


a Ration 


er — no hit 
WBC. Check(A) — no hit. 
Ty, Py = ORC_Check(A) 
WBC_Reg(DEV, A, Tv) 
ORC_Dep(Ty, DEV, A, Tv, 


data returns from DDR [eee ener 


This causes ; the dependent 
write from DEV to be acti- 


Remave the entry for Ty 
from the ORC. 
CMD(BWTGO,PX,Tv,A) 


BRD Co from the 
other device. 


We'll write the data di- 
rectly to the DDRAM after 
the victimization write com- 
pletes. 


Other device reads its data 
from the DDR. 


Contine as at cycle K in 
Table 7.36. (If the other 
device is a processor PY, 
then we would have seen 
a TAG_Check hit on PY. 
We didn’t, so we send a 
BWTGO to PX since the 
data is not currently cached 
by anybody.) 


CMD( ee ap 
(paces cieastiacial 7 Ty, addr= 


Table 7.35: Block Write to a Non Resident Block with a Read in Flight from Processor Y 
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7.10.4.9 Block Write to a Cached Block 


We decided that close integration between the fabric hardware and the processors is really important. We can 
gain a whole lot of performance over I/O based strategies if we provide a quick path for the DMA engine to return 
data back to a processor without requiring extra external memory traffic. 

For example, consider the “traditional way” that we might implement part of a packet receive operation. You 
might imagine that the DMA engine would pull the packet off the fabric and write it to DDR memory. Since 
we have an exclusive/noshare cache coherence protocol, when the DMA engine wrote the data to memory it also 
invalidated any cached copy of the data. So if processor 0 (PO) does a lot of MPILRECV operations to the same 
destination buffer, PO will have to fetch the received data from memory every time. That could add up to 80nS of 
overhead for every MPILRECV operation. But that is the way an I/O based strategy would do this. 

On the other hand, the DMA engine is pretty close to the L2 cache segments. So we’re not going to invalidate 
the cached copy of the data unless we have to. In Section 7.33 we described how the DMA engine could do a block 
write to a non resident block. Table 7.36 shows how this same transaction works when the data is already resident 
in processor PY’s cache. The transaction here assumes that the block is found in the EXCLUSIVE state, and not 
in the SHARED state. Section 7.10.4.10 describes the transaction flow for the latter operation. Note there are 
several “bad things” that can happen on the way to completing this operation. In most other transaction tables 
(above) I’ve left out the unpleasant paths, deferring discusion until later sections on hazards. I don’t do that here, 
because these hazards are central to the way the transaction works. In particular note, that we never trust to 
chance in the success of a retry. If a transaction encounters some condition that causes it to restart, we ensure that 
no other transaction could intervene so as to prevent successful completion. (That’s one of the powerful benefits of 
the chained dependence lists that are maintained in the ORC and WBC structures: once a transaction is registered 
in the ORC or WBC, it will complete before later dependent operations in the ORC or WBC complete or even 
attempt to use more L2 switch resources.) 

Nonetheless, it is possible that a block write could encounter an ORC hit or WBC hit that causes it to retry, 
only to find out that the processor holding the block has since evicted it. In this case, the retried operation is 
garunteed to complete successfully. 

All block write transactions carry a “HalfMask” field in the data half of the transaction. This allows the DMA 
engine to write 32 byte naturally aligned half-blocks to a cached block. HalfMask for a BWT transaction may send 
64 bytes, or the first 32 bytes in a block or the last 32 bytes. (See Figures 7.9, 7.10 and 7.11.) 
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CMD(BWT,COMn, 
Tv,A) 


7.10. DETAILED INTERFACE AND BLOCK DESCRIPTIONS 


COW Action 


TAG_Check(A) — Find a 
match against PY, way W. 
WBC_Check(A) — If there is 
a hit here, see Table 7.39 and 
7.38. 

ORC_Check(A) — If there is 
a hit here, see Table 7.40. 


CMD(PRBBWT,PY,Tv,A) 
WBC_Reg(DEV, A, Tv) 


receives 
warded Block Write 
command. If A 
does not hit in the 
L2, see Table 7.41. 
Otherwise, invali- 
date appropriate L1 
blocks. Record BWT 


in progress. 


A WBC hit implies that 
there is a colliding vic- 
tim write or block write 
in progress. We need to 
make sure the writes are se- 
quenced in order. 

An ORC hit implies a read- 
in-progress and that PY 
hasn’t yet acquired the data, 
though it has been assigned 
ownership for block A. 


PY could evict a block with- 
out informing the COH, or 
this could be a case of “ships 
passing in the night.” 


PRE BW TCO De TFAY 


Transaction is continued in Table 7.37. 


Table 7.36: Block Write to EXCLUSIVE Cached Data 


COW Action 


Recieve BWTGO, 
match Tv against 
outstanding write. 
DATA(PY,Tv,Dw) — 
send write data to 
processor Y. 


receives data, 
writes it to L2, re 
moves Tv from list 
of BWTs in progress 
CMD(BWTDONE, 
COHn, Tv, addr=0) 


Note that BWTDONE is 
sent to coherence engine, not 
to originating device. 

There may be depen- 
dent writes — see Section 
7.10.4.19. 


WBC_Rel(DEV,A, Tv) 


Table 7.37: Block Write to EXCLUSIVE Cached Data (continued from Table 7.36.) 
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Py, W = TAG_Check(A) — 

a match. 

Ty, Py = WBC_Check(A) — We 
find a HIT. 

ORC_Check(A) — There can be 
no ORC hit. 


WBC_Dep(Ty, DEV, A, Tv, 
BWT) 
WBC_Reg(DEV, A, Tv) 


COW Action 


A WBC hit implies that there 
is a colliding victim write or 
block write in progress. We need 
to make sure the writes are se- 
quenced in order. 

The WBC hit could be against a 
processor’s outstanding write, or 
the PCI widget, or another trans- 
action from this device! In this 
case, we’ll consider writes from 
PY. For collisions with the PCI 
or DMA engine, see Table 7.39. 
PY is recorded as the target, as it 
matched in the L2 cache lookup. 


PY writes its data to the DDR. 
This is probably a hint that we’re 
going to find that the block has 
been evicted from the L2 in PY, 
but we don’t know that yet. 


Continue as at cycle K in Table 
7.36. 
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COW ation Other Device Action 


meals CMD(BWT, COHn, Tv, 


Py, W = TAG_Check(A) — 
Find a match. 

Pw, Tw a 
WBC_Check(A) — Find a 
hit. 

ORC_Check(A) — There 
can be no ORC hit. 


WBC_Dep(Tw, DEV, A, 
Tv, BWT) 
WBC_Reg(DEV, A, Tv) 


WBC_Rel(Aw) -— 
that Tv is a dependent op- 
eration. 


case L2_NORD_WT: 


addr=0, 
=OTHER) 


A WBC hit implies that 
there is a colliding vic- 
tim write or block write 
in progress. We need to 
make sure the writes are 
sequenced in order. 

The WBC hit could be 
against a processor’s out- 
standing write, or the PCI 
widget, or another trans- 
action from this device! In 
this case, we’ll consider 
writes from the PCI wid- 
get or the DMA as the 
“other device”. For a col- 
lision with a write from a 
processor, see Table 7.38. 
PY’s write will wake this 
write up when it com- 
pletes. 

The writer registered in 
the WBC completes its 
write. 


This completes the write 
from the “other” device. 


Contine as at cycle K in 
Table 7.36. 
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TAG_Check(A) ig An ORC hit implies a read-in- 

match/nomatch. progress and that PY hasn’t yet 

WBC_Check(A) — no hit. acquired the data, though it has 

Py, Ty = ORC_Check(A) — hit been assigned ownership for block 

on access from Py, DMA, or PCI A. Since we got a tag match, 

(we'll call it PY for example.) we should queue up behind the 
RD transaction, since that’s the 
owner. Otherwise, we should just 
launch the write, since the read is 
by a cacheless widget. 

WBC_Reg(DEV, A, Tv) 

ORC_ pias DEV, A, Tv, 


DDR aw DATA for Ty OR CMD i aa COHn, Ty, | PY completes its operation and 
ORC_Rel(PY, A, Ty) 
which causes the COH to: 
CMD(BWTGO,Dev,Tv,A) 
oF eee BWTGO, match Tv 
w [egiwtomaningetis 
PO 


boat DATAPY 1 Tv,Dw) - aond write 
data to processor Y. 


mf *SYWBCRATTDEVATY) 
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COW Action 


CMD(BWTCOM TAY PT 


a ihitchs 

WBC_Check(A) — If there is a hit 
here, see Table 7.39 
ORC_Check(A) — If there is a hit 
here, see Table 7.40. 


CMD(PRBBWT,PY,Tv,A) 
See (DEV, A, Tv) 


Recieve Oe match Tv 
DATA(COH,Tv,Dw) — send write 
data to coherence Li since 


PY doesn’t care. 


A = WBC_GetAddr(Tv) 

Send Dw and A to DDR con- 
troller for write to DDRAM. 
WBC_Rel(DEV,A,Tv) 


PY receives 
Write command. A does NOT hit 
in the L2 cache. 


A WBC hit implies that there 
is a colliding victim write or 
block write in progress. We need 
to make sure the writes are se- 
quenced in order. 

An ORC hit implies a read-in- 
progress and that PY hasn’t yet 
acquired the data, though it has 
been assigned ownership for block 


forwarded Block | PY could evict a block without 


informing the COH, or this could 
be a case of “ships passing in the 


CMD(BWTNOHIT,Dev,Tv,addrs0Tell the device to continue the 
write to the coherence engine. 
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Device Action 


CMD(BWT,COHn, Tv,A) 


Recieve PRBINV, match Tv 
against outstanding write. 


DATA(COH,Tv,Dw) — send write 
data to coherence widget. 


COH Action 


TAG_Check(A) — Find a match 
against one or more blocks in the 
SHARED state. 

WBC_Check(A) — There can’t be 
a hit in the WBC. 
WBC_Check(A) — If there is a hit 
here, see Table 7.43. 


CMD(PRBINV,BROADCAST,Tv}Ahvalidate all blocks 


Foreach PY 
TAG_Check 
TAG_Update(Py, A, W, INV) 
WBC_Reg(DEV, A, Tv) 


matching in 


A = WBC_GetAddr(Tv) 

Send Dw and A to DDR con- 
troller for write to DDRAM. 
WBC_Rel(DEV,A,Tv) 


Transaction is sent from device 
“DEV” 


A WBC hit implies that there is 
a colliding victim write or block 
write in progress. That is incon- 
sistent with the state of the mas- 
ter tags. 

An ORC hit implies a read-in- 
progress and that PY hasn’t yet 
acquired the data, though it has 
been assigned ownership for block 
A 

in the 
SHARED state. 


Yup, this is an odd use of 
PRBINV. But note that any 
PRBINV that matches the TID 
for the device’s BWT, must be 
the result of the BWT. 

All processors send 
CMD(INVDONE, COHx, Tx, 
A). 
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Sale BWT,COHn, Tv,A) 


Recieve PRBINV, match Tv 
against outstanding write. 


DATA(COH,Tv,Dw) — send write 
data to coherence a 


TAG_ era — Find a match 
against one or more blocks in the 
SHARED state. 
WBC_Check(A) — No hit. 

Py, Ty = ORC_Check(A) — Find 
a hit. 

ORC_Dep(Ty, DEV, A, Tv, 
BWT) 
Foreach PY 
TAG_Check 
TAG_Update(Py, A, W, INV) 
WBC_Reg(DEV, A, Tv) 

DDR returns data for transaction 
Ty 


matching in 


ORC_Rel(Ty) See that (DEV, A, 
Tv) is a dependent operation. 
CMD( annals BROADCAST, 


A = WBC_GetAddr(Tv) 

Send Dw and A to DDR con- 
troller for write to DDRAM. 
WBC_Rel(DEV,A,Tv) 


processors 


CMD(INVDONE, COHx, 


Transaction is sent from device 
“DEV” 


An ORC hit implies a read-in- 
progress and that PY hasn’t yet 
acquired the data, though it has 
been assigned ownership for block 


A. 


We'll activate this BWT when 
PY completes its read operation. 
We invalidate the block for the 
“read in flight” since all future 
reads will queue up behind our 
entry in the WBC. 

Either COH sees the DDR return 
data for TID = Ty or PY sends a 
PRBDONE to the coherence wid- 
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7.10.4.11 Block Write and Other Probe Collisions with Victimization 


It is possible that a block write is forwarded to an L2 segment, acknowledged by the segment with a BWTGO 
command, and then arrives only to find that the target block has been displaced. We could prevent this by locking 
any block that is the target of a BWTPRB until the data side of the transaction completes. Unfortunately, that 
smells like a good way to create a deadlock. In fact, this is a problem for probes in general. 

The Coherence engine will, of course, detect this when the victimization writeback address matches against the 
BWT operation in the WBC. But that doesn’t help, as the COH has no control over the L2 segment’s completion 
of the victim writeback. The L2 is hell bent for leather on its way to writing the data to DRAM and there’s nothing 
that’s going to stop it. (Note that the victimization writeback arrived AFTER the BWT operation was forwarded 
from the COH, otherwise we’d have held off the continuation of the BWT operation.) 

There are lots of ways of handling this, most of them pretty complicated. Since BWT operations are relatively 
infrequent, and complete quickly, this is what we’ll do: (Note that this approach applies to all PROBE operations 
directed at a processor segment.) 

The L2 segment will hold off all L1 to L2 read transactions from the processor once it starts processing any 
kind of probe operation from the CSW. Since only a read operation can cause a victimization, and processors don’t 
execute WINVs, this ensures that a WINV or RDV/RDSV (writeback or victimization of a block) that is initiated 
before the segment begins processing a PRBBWT (or PRBBRD, PRBSHR, PRBWIN, or PRBINV) has completed 
before the decision is made to snd BWTGO or BWTNOHIT. Further, no new Ll to L2 read transactions are 
permitted until either a BWTNOHIT, PRBNOHIT, BWTDONE, or PRBDONE has been sent. Fore more detail, 
see the state machine descriptions of probe handling in the processor segment Section 6.22. 


THIS TABLE HAS BEEN REMOVED. 
Table 7.44: Block Write Collides with Victimization of Target Block 
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DMA Action COW Action 


CMD(BRD,COHn,Tx,A,W) 
2 TAG_Check(A) - no hit. (or | If there is a WBC hit, see Table 
shared) 7.AT. 
WBC_Check(A) - no hit found. If there is an ORC hit, see Table 
ORC_Check(A) - no hit found. 7.46. 
Send A to DDR controller 
FOr Reg DEV, Te 8) 


ee ee 


L+1 ORC_Rel(Tx) 


Return data from DDR to re- 
quester. 


Note that the ORC wakeup will 
forward any request to PY rather 
than the DMA widget, since the 
DMA has no cache. 
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DMA Action COW Action 
Paes CMD(BRD,COHn,Tx,A,W) 


TAG_Check(A) - no hit. (or | This is sequencing against an 
shared) RDS or another BRD from device 
WBC_Check(A) - no hit found. Pv — otherwise we’d be cached 
Tv = ORC_Check(A) - HIT. EXCLUSIVE. 

Send A to DDR controller 


Shootdown A in DDR controller. 
ORC_Reg(DEV, Tx, A) 
ORC_Dep(Tv, DEV, Tx, A) 


DATA(Pv,Tv,D) Return data from DDR to origi- 
ee ee ee 
ae ORT) 
px [J Sen 20 DDR] Continue at stop Lin Table TA 
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DMA Action COW Action 


a eee | | ee | ey ee 
TAG_Check(A) - no hit. This is sequencing against a vic- 
Tv = WBC_Check(A) - HIT. tim writeback or a BWT from de- 
ORC_Check(A) - no hit. vice Pv. 
Send A to DDR controller 


mes CMD(BRD,COHn,Tx,A,W) 


Shootdown A in DDR controller. 
ORC_Reg(DEV, Tx, A) 
WBC_Dep(Tv, DEV, Tx, A) 


CSR ee DATA(COMn, Tv, D Data from writer to Dain Rom water f@ DRAM 
Ne [ET A= WCRI) 
Se [_ rr tap TT Ta 
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7.10.4.13 Block Read to a Cached Block 


If the DMA or PCI widget reads a block that is currently in an L2 cache entry, we’ll leave it in the L2 cache. The 
processor segment that currently owns the block will flush its L1 updates (if necessary) to the L2 block and send a 
copy of the block to the DMA/PCI widget. The state of the cache block will not change. Table 7.48 describes the 
operation when the read completes after being forwarded to the owner. Table 7.51 shows the sequence when the 
block is no longer valid by the time the forwarded request arrives. 
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DMA Ketion COW Action 
can CMD(BRD,COHn,Tx,A,W) 


TAG_Check(A) - Hit on PY in 
EXCLUSIVE state. 
WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found. 
Send A to DDR controller 


CMD(PRBBRD,PY,Tx,A) 
Shoot down read of A in DDR 
controller. 


ORC_Reg(DMA, A, Tx) 


If the block is SHARED, see Ta- 
ble 7.45. 

If there is a WBC hit, see Table 
7.49. 

If there is an ORC hit, see Table 
7.50. 

Send a probe/intervention to PY, 
asking for block A to be for- 
warded to DMA. 


Teuepyuoy xoVI0DIS 


TAG_Check(A) - If no hit, see Ta- | If A does hit in PY’s L2, the state 
ble 7.51. should be EXCLUSIVE. If not, 
Flush L1 dirty to L2 block. we’ve got a problem. 


a 62 DATA(PX,Tx,D) Send data to processor X 


Accept data. 


ITV 
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CMD(PRBDONE,COHn,Tx,addr+0) 


er 
- a — lL 


Note that the ORC wakeup will 
forward any request to PY rather 
than the DMA widget, since the 
DMA has no cache. 
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DMA Aetion COW Ration 


CMD(BRD,COHn,Tx,A,W) 


oe 


TAG_Check(A) - Hit on PY in 
EXCLUSIVE state. 

Tv = WBC_Check(A) - HIT! 
ORC_Check(A) - no hit found. 
Send A to DDR controller 

Shoot down read of A in DDR 
controller. 

ORC_Reg(DMA, A, Tx) 
WBC_Dep(Tv, DEV, Tx, A) 


If the block is SHARED, see Ta- 


Wait on completion of write from 


DATA(Py, Tv, D) Either way, the write completes. 
OR 

CMD(BWTDONE, 

DEV, Tv, addr=0) 


DMA, 
WBC_Rel( a 
CMD( SaNDAD: PY, Tx, A) | Continue with step L in Table 

7.48. 
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DMA Action COW Action 


| 1 | 


CMD(BRD,COHn,Tx,A,W) 


Shoot down read of A in DDR Wait on completion of write from 
controller. Pv. 

ORC_Reg(DMA, A, Tx) 

ORC_Dep(Tv, DEV, ae A) 


DATA(Py, Tv, D) O CMD( Sao COHn, DEV, | Either way, the write completes. 
[teat Pe 
DMA, ' A, 


7.48. 


En ec | STS TY) 
TAG_Check(A) - Hit on PY in If the block is SHARED, see Ta- 
EXCLUSIVE state. ble 7.45. 

WBC_Check(A) - no hit. 
Tv = ORC_Check(A) - HIT! 
Send A to DDR controller 
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DMA Action COW Action 


PT [CMDB COmNT A) 
TAG_Check(A) - Hit on PY in If the block is SHARED, see Ta- 
EXCLUSIVE state. ble 7.45. 
WBC_Check(A) - no hit found. 
ORC_Check(A) - no hit found. 
Send A to DDR controller 
Av = TAG_Victim(A, W) 


CMD(PRBBRD,PY,Tx,A) Send a probe/intervention to PY, 
Shoot down read of A in DDR asking for block A to be for- 
controller. warded to DMA. 
ORC_Reg(DMA, A, Tx) 

TAG_Check(A) — no hit. 


_. CMD( aS DEV, Tx, | Senda NOHIT to the Tee 
addr=0) 


Ignore tag comparisons and all 
CAM ops. 
Send A to DDR controller. 


a 
Data arrives from : 
ORC_Rel(Tx) 
DATA(DMA, Tx, D) 
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7.10.4.14 Read from an I/O Location 


This is pretty much what you think it might be.!Assume for instance that processor X wants to read register 


R on processor segment Y. Table 7.52 shows the transactino flow. 


CMD(RDIO,DEV,Tx,A) Processor (or PCI/DMA) 
sends an IO read request to 
DEV. 
Match A against registers 
for this node. Fetch register 
data. 


DATA(X, Tx, D) Send data back to requestor. 
Note that this is just one 64 
bit word. All the other 7 
doublewords in this transfer 
are set to zero. 


Captre meommg data | SSCS SCC 


Table 7.52: I/O Register Read 


Surprise! 
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7.10.4.15 Write to an I/O Location 


It turns out that this is more interesting than you might imagine. For a variety of reasons, we’ve decided that 
data will never arrive at a processor port unless it has been requested by the processor. 7So, a write of an I/O 
register in a processor segment requires that we ask the processor segment to READ some data and load it into 
the target register! 

For example, let’s say that processor X wants to write data value D to register R in processor Y. Table 7.53 
shows how this will happen. 


| ‘Target Device Action | 
CMD(WTIO, DEV, Tx, A) Processor (or PCI/DMA) 
Write DATA and sends an IO write request to 
BYTEMASK to the target device, DEV. 
WTIOREG. 
Enqueue a RDIO east for 


CMD(RDIO, X, Note that a node can have 
WTIOREG) just one oustanding RDIO 
Store A in the WTIOADDR |} or WTIO transaction at a 
register for this node. time, so we don’t need a 
aa a here. 


DATA(DEV, Tx, DATA, 
EM 


Receive ee and 
BYTEMASK. Apply both 
to write the target stored in 
WTIOADDR register. 


Table 7.53: I/O Register Write 


?(This avoids a whole lot of queueing and buffering and flow-control/backpressure machinery that we could probably get right, but 
only with more effort than it would be worth.) 
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7.10.4.16 Read after Read Hazard 


Imagine the sequence Read(X,A) followed by Read(Y,A). In this case processor Y’s request should be forwarded 
to X so that we “do the right thing” relative to block ownership and the state of the block. 

That’s why we have the “Outstanding Read CAM” or ORC. The ORC is indexed by an address or a TID. Each 
entry contains the TID of a subordinate read and the low bits of the subordinate read address. It is important to 
note that a transaction will not hit on ORC entry if some previous transaction has already hit on that entry. (This 
allows us to build a “linked list” of subordinate operations on the ORC. The WBC works in the same way.) 

When Read(X,A) arrives it is sent directly to the DDR controller. If A matches a tag in the master tags we 
shoot the transaction down in the DDR controller. (See section 7.10.4.2.) If A matches a tag in the ORC, we shoot 
it down in the DDR controller. 

In each case, we allocate an entry K in the ORC (it is large enough to accomodate all 14 possible outstanding 
read operations) for Read(X,A) and record the address and TID. 

When Read(Y,A) arrives, we find that it hits in the ORC against entry K. Again we allocate an entry for 
Read(Y,A) (call it J) and write the TID for Read(Y,A) and low bits of the address into entry K. We also shoot 
down the Read(Y,A) operation in the DDR controller. 

When the DDR controller returns the data for Read(X,A) it also returns the TID for that operation. This TID 
will hit on entry K. We then read the TID for Read(Y,A) and the low bits of the address from entry K. This is 
packaged up into an appropriate PROBE request which the coherence controller sends to processor X. When X 
send the response data to Y, Y will send a PROBE DONE command back to the coherence controller. This will 
hit against entry J which will then cause a further probe to be sent out if some other processor Z has subsequently 
hit on entry J with a dependent read. 

In any case, the arrival of a ProbeDone or returned read data from the DDR will cause the appropriate entry 
in the ORC to be marked invalid. 

Isn’t that among the slicker things that you’ve seen? Tables 7.54, 7.55, 7.56, and 7.57 show what happens when 
the ORC entry is released for a cached read operation that has completed. 

Unfortunately, read-after-read hazards where the DMA engine or the PCI widget originates the first of the two 
reads (the read that is depended upon) is a little bit stickier. The BRD operation implies that the data is headed 
for a non-cached user. So we can’t send the PRBWIN or the PRBSHR to the DMA/PCI widget the way we did 
with reads that depended on other processor reads. There are a whole bunch of cases to consider. 


Cyele [ COW Ration 
Px, A, Op = ORC_Rel(Ty) | COH completes operation 
for PY and finds dependent 
operation (RDEX,RDV) for 
device PX. 


Cc 
A Continue at Table 7.10 or 
Table 7.12 at step L 


MD(PRBWIN, PY, Tx, | Tell PY to give up the block. 
) 


Table 7.54: Read After Read Hazard ORC Release for RDEX, or RDV following RDEX, or RDV 


Cyele [ COW Ration 
Px, A, Op = ORC_Rel(Ty) | COH completes operation 
for PY and finds dependent 
operation (RDEX,RDV) for 
device PX. 


CMD(PRBINV, BROAD- | Tell PY to give up the block. 
CAST, Tx, A) Continue at Table 7.28 at 
Send A to DDR Controller. | step M. 


Table 7.55: Read After Read Hazard ORC Release for RDEX, or RDV following RDS, or RDSV 
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Cyele [ COW Ration 
1 Px, A, Op = ORC_Rel(Ty) | COH completes operation 
for PY and finds dependent 
operation (RDS,RDSV) for 
device PX. 


CMD(PRBSHR, PY, Tx, A, | Tell PY to give up the block. 
ORIGIN=Px) Continue at Table 7.22 at 
step L 


Table 7.56: Read After Read Hazard ORC Release for RDS, or RDSV following RDEX, RDV, RDS, or RDSV 


DEV, A, COH completes operation 

ORC_Rel(Ty) for PY and finds dependent 
operation BRD for device 
DEV. 

CMD(PRBBRD, PY, Tx, | Tell PY to supply the block. 

A) Continue at Table 7.48 at 
step L 


Table 7.57: Read After Read Hazard ORC Release for BRD following RDEX, RDV, RDS, or RDSV 


Cyek 
1 Px, A, Tx, Op, Owner, State | COH completes opera- 
= ORC_Rel(Ty) tion for DMA/PCI and 
finds dependent operation 
(RDS,RDSV) for device 
PX. ORC lookup returns 
current block owner and 
current state. 
In this case, there is no 
owner. 
Send A, Tx, Px to DDR. con- | Queue up a DDR transac- 
troller. tion on behalf of Px. 
Continue at step N in the 
normal flow for the depen- 
dent operation on a non- 


cached block. 


Table 7.58: Read After Read Hazard ORC Release for RDEX, RDV, RDS, or RDSV following BRD to an UN- 
CACHED Block 


Px, A, Tx, Op, Owner, State | COH completes opera- 
= ORC_Rel(Ty) tion for DMA/PCI and 
finds dependent operation 
(RDEX,RDV) for device 
PX. ORC lookup returns 
current block owner and 


current state. 


The current state is EX, the 
owner is Py. 


2 CMD(PRBWIN, Owner, | Continue operation as at 
Tx, A) step L in Table 7.10. 


Table 7.59: Read After Read Hazard ORC Release for RDEX, or RDV following BRD to an EXCLUSIVE Block 
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Px, A, Op, Owner, State = | COH completes  opera- 
ORC_Rel(Ty) tion for DMA/PCI and 
finds dependent operation 
(RDS,RDSV) for device 
PX. ORC lookup returns 
current block owner and 


current state. 


In this case, there is no 
owner. 

The current state is EX, the 
owner is Py. 


CMD(PRBSHR, Owner, | Continue operation as at 
Tx, A, ORIGIN=Px) step L in Table 7.22. 


Table 7.60: Read After Read Hazard ORC Release for RDS, or RDSV following BRD to an EXCLUSIVE Block 


Px, A, Tx, Op, Owner, State | COH completes opera- 

= — ORC_Rel(Ty) tion for DMA/PCI and 
finds dependent operation 
(RDS,RDSV) for device 
PX. ORC lookup returns 
current block owner and 
current state. 
The current state is EX, Py 
is chosen to respond. 

CMD(PRBINV, BROAD- | Continue operation as at 

CAST, Tx, A) step M in Table 7.28. 

Send Px, A, Tx to DDR con- 

troller. 


Table 7.61: Read After Read Hazard ORC Release for RDEX, or RDV following BRD to an SHARED Block 


Px, A, Op, Owner, State = | COH completes opera- 

ORC_Rel(Ty) tion for DMA/PCI and 
finds dependent operation 
(RDS,RDSV) for device 
PX. ORC lookup returns 
current block owner and 
current state. 


In this case, there is no 


owner. 
The current state is SH, Py 
is chosen to respond. 


2 CMD(PRBSHR, Owner, | Continue operation as at 
Tx, A, ORIGIN=Px) step L in Table 7.22. 


Table 7.62: Read After Read Hazard ORC Release for RDS, or RDSV following BRD to an SHARED Block 
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7.10.4.17 Read after Write Hazard 


It is possible that processor X will attempt to read block A just as it is in the process of being evicted from 
processor Y. There are three possible alignment cases. 

First, Read(X,A) arrives at the coherence controller BEFORE Write(Y,A) has arrived. In this case the READ 
will find that the block is OWNED by Y and the coherence widget will send a PROBE request to Y. Y will complete 
the write operation with WriteData(Y,A,D). Y will then respond to the probe request that was forwarded on behalf 
of X with a PRBNOHIT to X. X will re-issue the Read(X,A) command which will arrive at the coherence controller 
as a read against a block that is non resident. * 

In the second case, Write(Y,A) arrives, followed by Read(X,A), followed by WriteData(Y,A,D). This is what 
the WBC (WriteBackCAM) is for. When the COH receives Write(Y,A) it registers the write in the WBC and sets 
the L2 tag for processor Y to INVALID. WBC is indexed by the address, A and a TID field. Each entry in the 
WEC contains the address of the write command, the TID for the write command (which is the alternate key), 
a valid bit, a dependent read TID, and the low bits of a dependent read address. (We need to account for the 
fact that the address A in Read(X,A) may not be the same as A in Write(A), but refers to the same cache block.) 
When Read(X,A) arrives, the A matches the address tag in the WBC entry. The TID for Read(X,A) and the low 
bits of A are recorded in the entry. At the same time, the coherence controller has already sent A on to the DDR 
controller. The match against an outstanding write causes the COH to send a Read-after-write shootdown signal 
to the DDR controller to clobber the read in progress. Later on, when WriteData(Y,A,D) arrives, the TID from 
this transaction will be matched against the secondary key in the WBC. The WBC will send the ADDRESS for the 
write operation on to the DDR controller (so it will know where to write this incoming data) and sends the read 
address for Read(X,A) and the TID to the RaW queue in the address path. This read operation is later sent on to 
the DDR controller when time permits. The key here is that the read operation will arrive at the DDR controller 
AFTER the write data. 

In the last case, Write(Y,A) and WriteData(Y,A,D) both arrive before Read(X,A). In this case, the Read will 
be processed as a normal read against non resident data. When WriteData(Y,A,D) arrives, the valid bit for the 
matching entry in the WBC is cleared. 

Finally, note that if a Read(X,A) matches an address in the WBC, but the entry already has a recorded 
dependent read operation, then we consider that the access has MISSED in the WBC. In fact, the operation should 
have HIT in the ORC since the presence of a read operation in the dependent read field of a WBC entry implies 
that the read operation has not yet completed. 


COM Aetion 
Px, A, Op = WBC_Rel(Ty) | COH Completes writeback 
operation for Py and finds 
dependent RDEX, RDV, 
RDS, or RDSV operation for 
Px: 


Send A, Tx, Px to DDR con- | Queue up a DDR transac- 

troller tion for Px. Continue at 
step N in the normal flow for 
the dependent operation. 


Table 7.63: Read After Write Hazard WBC Release for BRD, RDEX, RDV, RDS, or RDSV following BWT, WINV, 
RDV, or RDSV 


3Note that we don’t forward data from processor Y to X in this case, as the logic and sequencing to avoid the many race opportunities 
isn’t really worth the bother, given this particular sequence should not occur often. Otherwise we need to add extra address comparator 
machinery in the writeback buffer and all kinds of other junk. The heck with it. 


May 14, 2014 420 Rev 51328 


SiCortex Confidential 7.10. DETAILED INTERFACE AND BLOCK DESCRIPTIONS 


7.10.4.18 Write After Read Hazards 


Imagine the sequence Read(X,A), Write(Y,A)... Which version of the data should X receive? The answer is, 
that it doesn’t matter. The only case that matters is Read(X,A), Write(Y,A), Read(Z,A). In this case Z can see 
the same data as X (both see old data, both see new data) or Z can see newer data than X. But time must not 
apparently flow backward. We easily handle this as all DDR read transactions to the same bank are processed in 
order. Further, we know that the WBC will ensure that Read(Z,A) happens AFTER WriteData(Y,A). We also 
know that Read(X,A) arrived before Read(Z,A) and that Read(X,A) will be processed before Read(Z,A) because 
of the ordering rules in the address datapath. (Incoming commands on the address path allways take priority over 
entries in the RaW queue.) 

Because of our EXCLUSIVE ownership protocol, there really are only a few opportunities for a WAR hazard. 

First, Read(X,A) arrives at the COH just before a victimization writeback command from another processor Y. 
In this case, X’s read will be forwarded to Y and will encounter a NOHIT condition, since reads never hit against 
victimized blocks. (Note that this simplifies things a bit in the L2 segment design.) When X’s read encounters the 
NOHIT, it will be resent to the coherence controller where it will be turned into a DDRAM read. This will either 
hit in the WBC, in which case the read will be sent to the DDR after the write has completed, or it will miss in 
the WBC and be sent to the DDR controller and serviced after the write has completed. 

In the second case, Read(X,A) arrives at the COH after the victimization writeback (or block write) command 
from another processor Y but before the data has arrived. That’s what the WBC is for. Read(X,A) will hit against 
processor Y’s write back CAM entry and be enqueued. When Y delivers the writeback data, the WBC entry for 
Y’s transaction will be checked and the subordinate read for processor X will be launched. 

In the third case, Read(X,A) arrives at the COH just before a block write command. In this case, the COH will 
not forward the block write command, as it will HIT against X’s ORC entry. This case is covered in Table 7.40. 


7.10.4.19 Write After Write Hazards 


Because of the EXCLUSIVE ownership scheme that we’ve adopted, write-after-write hazards can only be caused 
by a block-write followed by another block-write or a processor’s eviction of a block. 

In the case of a block-write BW2 following a block-write BW1, the COH will register BW2 as dependent on 
the BW1 (by updating BW1’s entry in the WBC) and refrain from sending out the PRBBWT for BW2 until the 
COH receives a BWTDONE for BW1. 

In the case of a block-write BW1 followed by an eviction writeback VIC, the eviction writeback data must be 
ignored by the COH unless we can be sure that the evicting processor had a chance to “see” the block write data 
before it evicted the cache block. We know, that because of the list of outstanding BWT operations in each L2 (See 
Section 7.10.4.11), incoming BWT data will be reflected by a L2 that has evicted the target block and sent back to 
the appropriate COH. A victimization that occurs before the BWTGO command is sent out will result in a NOHIT 
condition in the L2 and is described in Table 7.41. A victimization that occurs after the BWTDONE command 
is sent out will be processed correctly, as there will be no hit in the WBC at the coherence widget. The “ships 
passing in the night case” where the BWTPROBE arrives after the RDV/RDSV command has been sent from the 
processor is covered by the CohWbe module that kills the eviction write if it hits against a BWT in progress. 
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7.10.5 Interrupt Delivery 


The ICE9 Chip doesn’t have a central interrupt controller. Instead, we deliver interrupt requests via central 
switch COMMAND cycles. Each interrupting device is responsible for figuring out which processor should field 
an interrupt request. When a device needs to signal an interrupt to processor X, for example, it will send a INT 
command with a reason code to X. The reason code is an eight bit number and an index into the processor’s 
interrupt cause register set. 

Interrupts from units that cannot issue CSW commands are delivered by the Slow Interrupts mechanism. See 
the Slow Interrupts registers in this chapter (section 7.18.8), and see the “Interrupts, Again” section of the Processor 
Segments chapter, (section 6.19.6). 


CMD(INT, PX, Reason<11:0> is driven on 
CswTid::INT, Reason) the low bits of the Address 
bus. All other bits are 0. 
All interrupts use a constant 
for the TID, “INT” from the 
CswTid table. 
PX writes Reason<7:0> to 


ICR[Reason<11:8>] and 
asserts interrupt chosen 
by Reason<11:9>. Both 
are cleared under processor 
(software) control. (See 
Section 6.19.) 


Table 7.64: Interrupt Delivery 


7.10.6 Special Communication Commands 


Similar to interrupt delivery, we wanted a special way of moving just a few bits from a processor to the DMA 
engine. The SPCL command handles this case. SPCL commands are single ended writes that carry all information 
(both the data and where it is supposed to go) in the Address field of the operation. It is up to the receiving node 
to “do the right thing” with the incoming operand. 

SPCL is triggered by a write to an address in the Spcl address range R_Spcl. (See Section7.18.17.) The physical 
address and the data are combined to produce a single value that is placed on the CSW address bus. Figure 7.14 
shows the layout of the SPCL address and the meaning of the individual fields in the physical address. The only 
supported destination bus stop is the DMA engine. 


Physical Address of Store 


35 19 16 15, 4, 8, 
Dest. 
Bus Addr1 Addr2 
Stop 


SPCL Region Code 
(constant OxEBE) 


Store Data 


Data<23:0> 


SPCL Encoding on CSW 


35 19 16 15 8 


Data<23:8> Addr1 Data<7:0> Addr2 


Figure 7.14: SPCL Physical Address Field 
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DMA or Py Action PX or DMA Action 


CMD(SPCL, Px, Td, Cm- CmdOp<35:3> is driven on 
) the Address bus. 


Px does the right thing with 
the incoming CmdOp, ac- 
cording to the target mod- 
ule’s spec. 


CMD(DONE, dev, Td, 0) 
— tell the sender that the 
SPCL is done. 


Table 7.65: Special Commands 
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7.10.7 WINV, Victim Writebacks and the WriteBack CAM 
Verification of the protocol encountered this rather gnarly sequence: 
1. PSO does a RDEX for 0x200 
2. Sometime later DMA does a BWT to 0x200 
3. Before the BWT is complete, PSO does a WINV where 0x200 is the victim writeback address 
4. PS2 does a RDEX for 0x200 


So what should happen? Clearly we want the BWT write data to end up in PS2’s cache. Were it not for the 
intervening victim write in step 3 the WBC ordering machinery would just make this happen. 

But there are two problems. First, the way the protocol is written the WINV will be entered into the WBC after 
the BWT operation. In fact, it will be registered as a dependent of the BWT. But we know that such writebacks 
never stall — they’re on the way to the DDR and there’s nothing that we want to do to stop that. (PSO will send a 
BWTNOHIT to the DMA engine after the writeback is complete.) So two bad things will happen: First the WINV 
will complete and trigger PS2’s RDEX. But that will happen before the DMA engines BWT is completed. Second, 
when the BWT does complete, it will finds its dependent is a WINV or writeback. How do you restart a WINV? 

A similar problem happens with RDV and RDSV operations, but in this case the BWT has no dependent 
registered to it. That is a further problem, since now we'll have TWO entries in the WBC that match the same 
address and who both think they’re the “last” such entry. 

We solve this problem with a couple of rules governing WBC_lookup, WBC_dep, and WBC_reg. 

(We'll use the phrase “victim writeback” to mean a WINV or the writeback portion of a RDV or RDSV.) 

WBC_reg registers a writer. All WINVs, RDVs, and RDSVs, register their write addresses in the WBC. This 
is as it always has been. 

WBC_dep never records WINVs, or the victim writeback portion of RDVs and RDSVs as dependent on a 
previous entry in the WBC. (There is nothing to wake up.) RDVs and RDSVs may be registered as dependent 
operations based on the READ addresses. 

WBC_lookup may encounter a case where an incoming request matches TWO entries that claim to be “last” in 
the WBC. In this case, if one of the two entries is a victim writeback, then we pick the other entry as the “parent” in 
the dependence chain. If neither of the two entries is a victim writeback, then we’ve got a machine check condition. 

Finally (for now) consider the following sequence: 


1. PSO does a RDEX for 0x200 and the read completes 

2. PS1 does a RDEX for 0x200 

3. PSO does a WINV (or RDV/RDSV with 0x200 as the victim) before PS1’s probe arrives 
4. PS2 does a RDEX for 0x200 


In this case, the PS2 access will hit in the ORC entry for PS1’s RDEX and it will hit on the WBC entry for PS0’s 
victim writeback. 

We know that the PS1’s RDEX should complete before continuing PS2’s RDEX. This is achieved by chaining 
PS2’s access to PS1’s access. But what of the victim writeback from PSO? 

The answer is that, even though WBC_lookup and ORC_lookup both returned hits for PS2’s RDEX, that 
operation should only be registered as a dependent on the ORC entry, not the WBC entry. In handling all cases 
where ORC_Hit and WBC_Hit are both true, we behave as if WBC_Hit was false. This works because PS1’s RDEX 
will arrive at PSO and get a PRBNOHIT after PSO has completed the victim writeback. (Note that this is a 
requirement on PS0’s behavior.) 

If not for the intervening read from PS1, PS2’s transaction would have missed in the ORC and hit in the WBC. 
In this case, it would be chained on PSO’s WINV completion. 


7.11 WRSTRANS and When Bad Things Happen to Good Blocks 


WRSTRANS is used to force a transition from some D-stream readable state (EXCL, DIRTY, UPDATED) 
to SHARED. It comes into play when one processor segment X issues an RDS to address A when A is owned 
and in one of the D-stream readable states in some other processor segment Y. X will send the RDS to COHx 
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(either COHE or COHO) which will detect the hit on Y’s cache and forward a PRBSHR to Y. Y will then send a 
WRSTRANS to COHx along with the data from block A. (This is because we’re going to give A to X which will 
never write the block back to the DDR, so we have to do the writeback now in case the data is dirty.) 

For a whole lot of reasons, we don’t have Y send data to X directly. Instead, there is machinery in the COH 
that 


1. Remembers the last target address for any RDS or RDSV in an array that is indexed by transaction ID. 


2. Matches the address of an incoming WRSTRANS (which needs to use a TID from segment Y rather than the 
original TID from X’s RDS request. Otherwise the writeback caused by the WRSTRANS could be confused 
with a writeback from X caused by an RDSV.) When an incoming WRSTRANS address matches an entry 
in the array, COHx will re-issue the read from X to the DDR so as to complete the transaction. 


3. The DDR read won’t be restarted until we’re sure the data has been written to the DDR. 


7.12 One Thousand Ships, One Thousand Nights 


There are a bazillion possible interactions between probes from other processors/devices and outbound requests 
from a processor segment. Most have tickled one bug or another in the L2 controller or the Coherence widget. Here 
are a few of them: 


7.12.1 Read Retry vs. Victim Writebacks 


Imagine that CORE1 sends a PRBWIN A (via the COH) to COREO sometime after COREO has victimized 
block A. There are two things to note here: first, COREO may respond with a NOHIT before its write data has 
arrived at the DDR controller; second, the DDR controller does not preserve ordering of read and write commands 
that arrive from the COH. It is the responsibility of the COH to ensure that no read is issued to the DDR until 
after any previous writes to that location have made it to the DRAMs. 

The ordering is maintained by a mechanism in the COH. When a retry read (RDEXR or RDSR) arrives at the 
COH, the coh builds a list of all currently outstanding L2 writeback transactions. (That is, all transactions caused 
by RDSV, RDV, WINV, but not BWT.) If the list is empty, the read retry is sent all the way to the DDR controller 
without delay. If the list is not empty, the read retry request is queued until each of the transactions in the list 
have been retired by the DDR controller. (The DDR controller indicates that a write has completed by asserting 
the ddr_coh_WtTIDValcda signal.) Once the list is empty, the retry reads are resubmitted to the DDR controller 
and will complete. 


7.12.2 PRBWIN A followed by RDEX A 


This problem was uncovered by the following trace: 


((TIME 2144) (FROM CORE2) (TO COREO) (TID PS2TO) (CMD PRB- 
WIN) (ADR #x000000038E8D4FCO) (BMASK #x00) (WAY 0)) 
((TIME 2152) (FROM COREO) (TO COHO) (TID PSOTO) (CMD RDEX) (ADR #x000000038E8D4FCO) (BMASK #xFF) (WAY 
((TIME 2172) (FROM PCI) (TO COREO) (TID PSOTO) (CMD PRBNO- 
HIT) (ADR #x000000038E8D4FCO) (BMASK #xFF) (WAY 0)) 
((TIME 2184) (FROM COREO) (TO CORE2) (TID PS2TO) (CMD PRBNO- 
HIT) (ADR #x0000000000000000) (BMASK #x00) (WAY 0)) 
((TIME 2188) (FROM COREO) (TO COHO) (TID PSOTO) (CMD RDEXR) (ADR #x000000038E8D4FCO) (BMASK #xFF) (WA’ 
(((TIME 2200) (FROM COHO) (TO COREO) (TID PSOTO) (MOD_STATE CLEAN) (HM W64) 
(DWO #x2DB5EB453184B323) (DW1 #x237A9EF2B4695462) (DW2 #xB81F62A3E366D5D1) (DW3 #x7E3C120113FC9101 
(TIME 2208) (FROM COHO) (TO COREO) (TID PSOTO) (COMPLETE T)) 
note that this trace is broken in a couple of ways 


The first part of this sequence could arise if COREO displaces adderss A (000000038E8D4FCO) between the time 
CORE2’s RDEX/V arrived at the COH and the time the PRBWIN arrived at COREO. When COREO0’s RDEX 
arrives at the COH, we know that it will queue up in the ORC against CORE2’s forwarded RDEX. Therefore, the 
entry at time 2172 in this trace is erroneous, as the COH will not forward the RDEX for PSOTO until PS2TO has 
completed. 
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On the other hand, PS2T0 must complete so COREO must send a PRBNOHIT to CORE2. We solve this 
problem by noting that the tag array is not updated in the L2 until the fill data has been returned, so any probe 
lookup against A will miss in COREO’s L2. In this case, for example, COREO must send a NOHIT before it gets 
its read data. That’s what the CacCtl probe control state machine does. 

Note there are problems with the stream that I used for illustration: the PCI should never have sent the 
PRBNOHIT to CORED as the arrival of a PRBWIN for the same address from CORE2 indicates that COREO’s 
RDEX will wait in the ORC until CORE2’s read is complete. Also, PRBNOHIT should be sent with and address 
of 0. 


7.12.3 PRBXXX A While A Is Being Evicted 


Imagine that COREO owns block A and that CORE1 wants it. CORE1 (via the COH) sends a PRBWIN A to 
CORED just after COREO has sent an RDSV B where A is the victim block. 

By our normal rule, COREO will perform a tag lookup on A and find a HIT. (Note again that L2 tags aren’t 
updated until fill data returns, so the L2 still shows A as valid until B is returned.) But that, of course, is the 
wrong thing to do. In this case, the CAC must notice that the probe address has hit against a victim writeback 
and “do the right thing.” The actual sequence depends on the type of probe. 


7.12.3.1 PRBWIWN Against an Evicted Block 


Imagine that CORE1 has sent a PRBWIN for A. When we send the PRBNOHIT to CORE1, CORE1 will 
respond with RDEXR to the COH. The COH read retry handler will “hold” the RDEXR request until all writebacks 
currently in flight have completed. (See Section 7.12.1.) The only requirement here is that the victim writeback must 
have been registered in the COH before the RDEXR arrives. This is satisfied if we delay sending the PRBNOHIT 
until after the data for the victim has been driven onto the CSW. PRBNOHIT responses are delayed in the CMX 
until any outstanding writeback transactions have completed. 


7.12.3.2 PRBSHR Against an Evicted Block 


In this case, CORE1 has sent a PRBSHR A to COREO which is victimizing A. (Well, at least this isn’t going to 
be as ugly asa WRSTRANS sequence.) COREO must hold off sending the PRBNOHIT signal until the outstanding 
victim write data has been sent. PRBNOHIT responses are delayed in the CMX until any outstanding writeback 
transactions have completed. 


7.12.3.3  PRBBWT Against an Evicted Block 


This one appears in BugZilla 860. Imagine the following sequence: 
e DMA DMAWTO BWT A 

e COREO PSOT1 RDV B, victimize A 

e PCI PCIWTO BWT A 


The important thing to ensure is that the writes to memory occur in the following order: COREO data, DMA data, 
PCI data. (Why? Because not all BWT’s write all 64 bytes.) How do we do this? Note that the WBC queuing rule 
will ensure that PCIWTO is registered as a dependent on DMAWTO. (Not on PSOT1, since a transaction arriving 
at the WBC will either register as a dependent on a WINV/writeback only if there are no other “last” writers to 
the target address in the WBC.) So, we need to make sure that DMAWTO doesn’t write its data to the DRAM 
until after PSOT1’s data arrives at the DRAM. 

The good news here is that the DDR controller preserves write ordering of blocks with the same address. So, 
all we have to do is to ensure that COREO sends the PRBNOHIT to DMA after COREO has sent its data to COH. 
(COH forwards all writeback data along with the writeback address to the DDR when data arrives.) We ensure 
nohit ordering for PRBBWT responses by requiring that the writebuffer in an L2 segment is empty (or that there 
are no victim writebacks in progress) before sending a PRBNOHIT. PRBNOHIT responses are delayed in the CMX 
until any outstanding writeback transactions have completed. 
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7.12.3.4 PRBBRD Against an Evicted Block 


Consider this sequence: 
e DMA->COHx DMARDO BRD A 


e CORE0->COHx PSOT1 RDV B, victimize A 


e COHx->COREO DMARDO PRBBRD A 


In fact, the PRBBRD could come before or just after the RDV. In this case, the appropriate response is NOHIT, 
but the COREO should not send the NOHIT response until it has sent its data back to the COH/DDR. Once the 
data has been driven onto the CSW, the COH will ensure that the block read retry (BRDR) from DMA will arrive 
at the COH and queue up until the victim write has made it all the way to the DDR. 


7.12.3.5 PRBINV Against an Evicted Block 


PRBINV commands require a response. Each processor must return INVDONE to the originating Coherence 
widget once an PRBINV has been processed. Note that PRBINV commands should only arrive for blocks that are 
in the SHARED state. The processor never writes back blocks in the shared state, so PRBINV A will never arrive 
during an eviction of A, though it may arrive while A is being “replaced.” 

We ran into a nasty protocol issue pretty late in the game. Imagine that thread X is executing Emacs and 
loads up processor 0’s L2 cache with code from Emacs. One of the blocks of Emacs code resides at address A. Now 
imagine that thread X exits and processor 0 is then used to run a new thread Y. The OS will perform an L1 ICache 
flush of A, but because we don’t communicate cache flush operations to the L2, A still resides in the SHARED 
state in processor 0’s L2. And it contains Emacs code. Imagine that Y is running Quake XVII. Thread Y sends a 
request to the PCI to page in the code for Quake XVII at location A. The PCI sends a BWT request to the COH 
which forwards a PRBINV to processor 0. But in this case, processor 0’s bus stop is really busy and the PRBINV 
gets stuck in processor 0’s probe queue or even in the incoming command queue in the CSW. Meanwhile the PCI 
finishes the BWT and sends an interrupt to processor 0. Alas, the interrupt doesn’t pass through the probe queue 
and goes directly to the interrupt register to tell thread Y to go ahead and use the code at block A, as the PCI 
thinks it is now visible. Thread Y wakes up and executes the OLD instructions in block A (from Emacs) instead 
of the NEW instructions from Quake XVII. Hilarity ensues. 

If we had to do it all over again, we’d probably use some kind of software mechanism, but at this point software 
invalidates of a page of L2 cache would be very expensive. Instead, we get some help from the protocol. 

If a COH sends a PRBINV out in the course of completing a read or write request for TID M from the PCI or 
DMA, it will set a PRBINV_CTR[M] to 6 and assert TID-BUSY[M] until PRBINV_CTR[M] is zero. (It will also 
hold TID_BUSY[M] true until the read/write is otherwise complete.) When an INVDONE command arrives with 
TID = M, PRBINV_CTR[M] will be decremented. Thus, the PCI widget performing a BWT to our address A will 
not complete the BWT operation until all processors respond with an acknowledgement. (PRBINVs sent out for 
a TID R belonging to processors (as opposed to the PCI or DMA engine) cause PRBINV_CTRJR] = 5.) 

This requires one more adjustment on the part of the PCI (and the DMA engine if it ever overwrites a code 
page). Interrupts to a processor to signal the completion of a page transfer must not be sent until the last WRITE 
for that transfer has completed. Further, we may consider whether we want to hold off all PCI writes until the 
completion (that is, release of the TID) of a BWT that received a PRBINV reply. This ensures that all updates to 
memory appear in order. 


7.12.4 PRBXXX A Just Prior to Evict Attempt on A 


The arrival of a probe request at a processor segment may not be processed by the CacCtl unit for several cycles. 
Thus a probe arriving before an eviction attempt may not be processed until well after the eviction attempt. In 
that case, we follow the paths described above. If the probe is processed before the eviction attempt, the receiving 
segment will send a BWTGO or PRBWIN response before allowing the processor to initiate the L2 access that 
would have caused the victimization. 

In the case of a PRBBWT arriving just prior to what would have been an eviction attempt, the CAC will hold 
off all processor accesses to the L2 until the BWT operation has completed. 

In the case of a PRBWIN arriving just prior, the CAC will hold off all processor access to the L2 until after the 
tag array has been updated and the block made invalid. 

In the case of a PRBSHR or PRBBRD, the CAC will hold off processor access to the L2 until after data has 
been read from the L2 and sent to the requesting device. 
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7.12.5 Implications for Stimulus Generators and Checkers 

The sections above describe what the Cac and Coh will do in response to a number of “ships passing in the 
night” sequences. The responses and rules have some implications for stimulus generators and BFMs. 
7.12.5.1 NOHIT sequencing against writeback data 

A processor segment (PS) will never emit the following sequence: 

e RDV A (victim B) 

e PRBNOHIT (for address B) 

e WRITE DATA (for address B) 


This is impossible since the eviction of B will cause the processor to defer responding with a PRBNOHIT until 
after the victim data has been sent out onto the CSW. 


7.13 Command Fields 


Certain of the CSW commands require an address or bytemask or some other value to be meaningful. Other 
commands stand on their own. Table 7.66 shows the required fields for each of the CSW command types. Where 
a field is not required by a command, it should be driven as 0 and ignored at the receiver. 
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Table 7.66: CSW Commands, Required Fields 


7.14 Transaction IDs (TIDs) and TID Busy Signals 


Among the CSW signals described in Tables 7.1 and 7.2 are the TIDBusy signals. These are used to indicate 
to a CSW client that the corresponding TIDs are in flight within either the even or odd coherence controller. 
A TID is “in flight” in a coherence widget if 


1. The TID corresponds to a valid entry in the ORC, or 
2. the TID corresponds to a valid entry in the WBC, or 


3. the TID was attached to a read operation sent to the DDR that has not yet either returned data or been shot 
down, or 


4. the TID was attached to a write operation sent to the DDR that has not yet “completed” in the eyes of the 
DDR controller. 
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Each coherence widget originates its own version of the TID busy wires. At each bus stop, the TIDBusy output is 
the result of ORing the TID busy bits from the EVEN COH and from the ODD COH. The COHE TID busy wires 
are cohe_csw_TIDBusy_c4a[27:0] and the corresponding COHO wires are coho_csw_TIDBusy_c4a[27:0]. 

The TIDBusy bits from each of the coherence widgets are ORed together and distributed by the CSW after 
being flopped. For a processor/L2 segment PSO, the CSW output is 


csw_psO_TIDBusy_c5a[0] = coho_csw_TIDBusy_c4a[PSOTO] |] cohe_csw_TIDBusy_c4a[PSOTO] ; 
csw_psO_TIDBusy_c5a[1] = coho_csw_TIDBusy_c4a[PSOT1] || cohe_csw_TIDBusy_c4a[PSOT1] ; 


For the DMA and PCI widgets, the CSW outputs are 


csw_dma_RdTIDBusy_c5a[3:0] = coho_csw_TIDBusy_c4a[DMARD3:DMARDO] | cohe_csw_TIDBusy_c4a[DMARD3 : DMARDO. 
csw_dma_WtTIDBusy_c5a[3:0] = coho_csw_TIDBusy_c4a[DMAWT3:DMAWTO] | cohe_csw_TIDBusy_c4a[DMAWTS3 : DMAWTO. 


Internal to the COH widgets, TIDs are tracked for both WRITE and READ operations. That is, a TID that 
involves both a read and a write is the logical OR of the RD TID Busy state machine output, the ORC valid bit 
for this TID, and the WBC valid bit for this TID. The Read TID Busy state machine is described in Figure 7.15. 


RDEX, RDS, RDV, RDSV, BRD 


Read data 
returned 
from DDR 


Shootdown 
complete from 
DDR controller 


Shootdown command sent 
to DDR controller 


Figure 7.15: Read TID Busy State Machine 


The tracking mechanism depends on several signals from the DDR controller 
ddr_coh_DataTID_c2a<4:0> The TID for a read data operation that is about to complete 


ddr_coh_DataValid_c2a If true, the corresponding TID has sucessfully completed a read of DDR memory. Per- 
form an ORC_Release on the outstanding transaction and cycle the RDTID Busy state machine back to the 
free state. 


ddr_coh_RdShotDown_c2a If true, the corresponding TID’s read operation was shot down. Cycle the RDTID 
busy state machine back to the free state. 


ddr_coh_WtTID_c5a<4:0> The TID of a write operation that has passed the ordering point in the DDR 
controller, that is, we now know that the write for this TID has been sent to the DDR DIMMS when the 
WtTIDVal bit is set 


ddr_coh_WtTIDVal_c5a When true, the corresponding TID should case a WBC_REL operation. 


7.14.1 TID Allocation — the IO and MEM TID Spaces 


To avoid a nasty and obscure deadlock situation, the processor segment must allow a cache read/replacement 
operation to proceed in parallel with an IO write or, potentially, an IO read operation. This could require that 
TID1 for a processor segment (normally used for RDV, RDSV, and IOWT operations) be used by both an IO 
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write and an RDV/RDSV at the same time. We allow this by treating TIDs belonging to processors as existing 
in two different spaces. PSnT0O and PSnT1 (where n is in the range 0 to 5) can represent transactions in IO space 
or memory space. If the accompanying command is an IOWT, IORD, SPCL, INT, or DONE, the TID should be 
treated as an IO space TID. Otherwise it is for a memory space transaction. If the accompanying data is a double 
word, then the TID should be treated as an IO space TID. TID busy only reports the condition of TIDS in memory 
space. IO operations may be emitted from a processor segment if the required TID is not otherwise occupied in IO 
space. 


7.15 The Parts 


7.15.1 The Coherence Controller (COH) 
7.15.1.1 Block Diagram 


The Coherence Controllers (Instance names are COHE for “Even” side coherence widget, and COHO for “Odd” 
side coherence widget.) field data transfer requests from the six processors, the PCI controller, and the DMA 
engine. In addition, each coherence controller services I/O requests for the configuration registers in its associated 
DDR controller. 

Each coherence controller contains 


e Six 2K by 44 bit TAG arrays (parity protected) 


e One 14 entry Outstanding Read CAM that can be indexed by virtual address bits 35:7, or by a six bit 
transaction ID. Its payload is the Transaction ID and low address bits of the dependent operation, and a 
Valid bit. 


e One 14 entry WriteBack CAM that can be indexed by VA<35:7> or by the TID. Its payload is the TID and 
low address bits of the dependent operation, and a valid bit. 


The CAMs, being implemented in flip-flops, rather than RAM cells, need not be ECC protected. The SER (soft 
error rate) for the Tag RAMs is such that we’d see a TAG error about once every 30 years. On the other hand, a 
bit error in the Tags could cause us to generate a “wrong” result or launch the missile, so we'll parity protect the 
RAMs and force a system recovery if an error is detected. 
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Figure 7.16: Coherence Controller Block Diagram 
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7.15.1.2 Processing Pipeline(s) 


Commands are processed in a two or four stage pipeline that begins in C3 (to align with the pipelines from the 
processor segments, the PCI and the DMA engine.) 


C3: In C3 we look the address up in each of the Master Tags arrays, the ORC, and the WriteBack CAM. 


C4: In C4 we update the Tag arrays, the ORC, and the WriteBack CAM. We also send out any transaction 
operations on the outbound command ports. 


Data returns from the DDR controller in C10. (This is an arbitrary choice, but it seems to fit well with the rest 
of the pipeline definitions in the DMA engine, etc. Data is written into the DDR controller in C3 and following 
cycles. Figure 7.17 shows the four major processing pipelines in the coherence engine. 
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Figure 7.17: Coherence Engine Processing Pipelines 
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Tables 7.67, 7.68, and 7.69 all assume an incoming command from processor or unit Px. If the target block is 
owned, Py is the owner. 

Normally, an operation can’t hit in both the WBC and the ORC in the same cycle. PRBDONE and WRSTRANS 
are the only exceptions. Ignore the WBC hit in these cases. 

Note that WRSTRANS hitting on an EXCLUSIVE block means that we saw a sequence like (RDS,Px,A) 
(RD,Pz,A) (WRSTRANS,Py,A) where processor Z flipped the L2 cache states from SH to INV in Px before the 
transaction completed. This is OK. Everything will eventually complete in order and Px will have seen the block 
for a short time in the SH state before answering a PRBINV broadcast. (This is the reason we write EXCLUSIVE 
blocks back to DDR rather than sending the data directly from Py to Px.) 


7.15.1.3 Recovering from Tag ECC Errors 


As it turns out, the master tag arrays contain about 500K bits of storage. We’re likely to see a soft error rate 
on the SRAM cells of about 3000 failures per billion hours per Mbit. So, assuming 1000 chips in a system: 


10° hours * 1 Mbit 
MELE yeas eS Eaves KO SMB ne = 28d, 
1000 nodes « 3000 failures *0.5 Mbit ours ays 


That means that we’d see a tag parity failure about once per month. We can’t really recover from that kind 
of error so we’d have to crash the node and probably the rest of the cluster. That’s one of the problems with 
welding the fabric so close to the processors — if a processor sneezes, the fabric catches pneumonia. So, we need to 
innoculate the processors by building ECC into the tag RAMs. (Note that we don’t need to do this for the CAMs 
since they’re implemented in much more robust flip-flops.) 

The tag rams cycle at 4nS, so we have more than enough time to do ECC scrubbing and correction. Tag entries 
are written in the second stage of the command processing pipeline, so we have enough time to calculate the ECC 
before the tag update cycle. 


7.15.2 The L2 Switch (CSW) 
7.15.2.1 Bus Stops, Node Numbers, and Transaction Targets 


7.16 Arbitration at the PS to CSW Port 


Commands issued by the CAC (RDS, RDSV, RDEX, RDV), the processor (IOWT, IORD, SPCL, INT), or in 
response to probe operations (WRSTRANS, BWTGO, PRBNOHIT) all must contend at the output of the CAC/PS 
for the outbound command request wires. Arbitration between these request streams is more complicated than 
one would hope, but simulation and detailed analysis suggest that the scheme is not prone to deadlock. (Neither 
simulation nor logical argument can ever garuntee freedome from deadlock, but we do the best we can.) This 
section describes the arbitration rules and makes the argument that no combination or sequence of requests can 
cause any one request to remain starved for access to the CSW. 

The arbitration is a hybrid priority based and round robin scheme. First, any requests that must be retried 
from a previous cycle are garunteed access to the outbound command wires. Second, if an L2 cache miss access 
(RDS, RDSV, RDEX, RDV) was not driven on to the command bus in the previous tic, the waiting request wins. 
Third, if a NOHIT response causes the CAC to issue a RDSR or RDEXR (read retry), the retry request is driven 
onto the bus. Fourth if there are no waiting requests, but the L2 tag lookup in the previous cycle has resulted in an 
L2 miss, the requested command (RDSV, RDV, RDS, RDEX) is driven onto the command wires. If none of these 
conditions obtains, then we move on to the group of requests that arbitrate in “round robin” fashion. (Note that the 
priority based portion of the arbitration is deadlock free as the CAC only supports one outstanding memory read 
access at a time. Thus an RDSR will never contend with an outgoing RDEX, and an L2 miss will never contend 
with a previously queued memory request.) 

Ten different sources of outbound command requests contend in the second stage of arbitration: 


LOCINVDONE: Sends out an INVDONE in response to a PRBINV that arrives at this CAC due to a read 
operation that was initiated by this CAC. 


WRSTRANS: Sends out a Write Shared Transition command in response to a PRBSHR on a block held in the 
exclusive state. 
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RDEX Launch read to DDR. Kill read to DDR. Kill read to DDR. 
Update L2 Tags for PX. Update L2 Tags for PX. Update L2 Tags for PX. 
Add PX request to ORC. Add PX request to ORC. Add PX request to ORC. 
Add PX dependence on PY trans- | Add PX dependence on PY trans- 
action in ORC. action in WBC. 
Launch read to DDR. Kill read to DDR. Kill read to DDR. 
Update L2 Tags for PX. Update L2 Tags for PX. Update L2 Tags for PX. 
Add PX request to ORC. Add PX request to ORC. Add PX request to ORC. 
(RDSV: Add victim address to WT | Add PX dependence on PY trans- | Add PX dependence on PY trans- 
Q.) action in ORC. action in WBC. 
Launch read to DDR. Kill read to DDR. Kill read to DDR. 
Add PX request to ORC. Add PX request to ORC. Add PX request to ORC. 
Add PX dependence on PY trans- | Add PX dependence on PY trans- 
action in ORC. action in WBC. 
Send BWTGO to requester. Queue transaction dependence on | Queue transaction dependence on 
Add PX request to WBC. PY in ORC. PY in WBC. 
Add PX request to WBC. Add PX request to WBC. 


Retry event reacting to “NOPROBE?” response: Launch read to DDR. 
Retry event reacting to “NOPROBE?” response: Launch read to DDR. 


Error! Writeback from non-owning | WINV from PX passed an inflight | WINV from PX passed an inflight 

processor! Complete write, update | PRBWIN for this block. BWT for this block. 

L2 Tags, Declare Machine Check | Add Addr to WT Queue and WBC. | Kill transaction in Write Queue, as 

Exception. the BWT takes precendence. 
Error! Flush from non-owning processor! Update L2 Tags (invalidate). Declare Machine Check Exception. 


RDIO Transactions Never Arrive at COH 
Tamar 
Cancelled operation — do nothing. 
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Incoming Command | ORC Miss, WBC Miss ORC Hit, WBC Miss WBC Hit, ORC Miss 


RDEX 


Send read to DDR. 

Update L2 Tags for PX to EX. 

Invalidate L2 Tags for ALL other match- 
ers. 

Add PX request to ORC. 

Broadcast PRBINV to all nodes. 


Kill read to DDR. 

Update L2 Tags for PX to SH. 
Add PX request to ORC. 

Send PRBSHR command to 
matcher” PY. 


“first 


Kill read to DDR. 

Update L2 Tags for PX to EX. 

Invalidate L2 Tags for all other matchers. 
Add PX request to ORC. 

Add PX dependence on PY transaction in 
ORC. 


Kill read to DDR. 

Update L2 Tags for PX to SH. 

Add PX request to ORC. 

Add PX dependence on PY transaction in 
ORC. 


Not Possible. 
(If a write is outstanding against the block, 
why is in SHARED state?) 


Add Victim Address to WT Queue and WBC. Otherwise, identical to RDEX 


Not Possible. 


Add Victim Address to WT Queue and WBC. Otherwise, identical to RDS 


Kill Read to DDR. 
Send PRBBRD to PY. 
Add PX request to ORC. 


Send BWTGO to Px. 

(Note the COH will send PRBINV after 
BWTDONE arrives.) 

Add PX address to write queue. 

Add PX request to WBC Note need for 
PRBINV. 


Kill read to DDR. 

Add PX request to ORC. 

Add PX dependence on PY transaction in 
ORC. 

Queue transaction dependence on PY in 
ORC. 

Add PX request to WBC. 


Retry event reacting to “NOPROBE” response: Launch read to DDR. 


Retry event reacting to “NOPROBE” response: Launch read to DDR. 


Error! WINV should only arrive for ex- 
clusively owned blocks unless we have a 
ships-passing-in-the-night problem (ORC 
or WBC hit). 


Add victim to WT queue and WBC. We’ll 
wait for the RDSR/RDEXR. 


Not Possible. 


Not Possible. 


Queue dependency on PY in WBC. (We 
Queue dependency on PY in WBC. (We 


Collision with a BWT. Kill this write when 
it arrives. Create no WT queue or WBC 


entry. 


FLUSH (UNUSED) Invalidate L2 Tags for PX. 
RDIO RDIO Transactions Never Arrive at COH 


WTIO 
PRBDONE Error! Should hit on ORC entry. 


WRSTRANS Error! WRSTRANS should hit on the 
ORC entry for the transaction that caused 
it. 


BWTDONE Error! Should hit on WBC entry. 


WTIO Transactions Never Arrive at COH 


Activate matching ORC Rnitry. 


Find FIRST ORC entry for this address 


(ORC_CheckS). 

Add Addr to WT Queue and WBC. 

See Table 7.22 steps L+2 and following. 
(Note there may be a spurious WBC hit 
for this operation. Ignore it.) 

Ignore 


Activate matching WBC entry. 
Broadcast PRBINV to all. 


IDLE Cancelled operation — do nothing. 
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RDEX Kill read to DDR. Kill read to DDR. Kill read to DDR. 

Update L2 Tags for PX. Update L2 Tags for PX. Update L2 Tags for PX. 

Invalidate L2 Tags for PY. Invalidate L2 Tags for PY. Invalidate L2 Tags for PY. 

Add PX request to ORC. Add PX request to ORC. Add PX request to ORC. 

Send PRBWIN command to PY. Add PX dependence on PY trans- | Add PX dependence on PY trans- 
action in ORC. action in WBC. 

Kill read to DDR. Kill read to DDR. Kill read to DDR. 

Update L2 Tags for PX. Update L2 Tags for PX. Update L2 Tags for PX. 

Update L2 Tags for PY to SH. Update L2 Tags for PY to SH. Update L2 Tags for PY to SH. 

Add PX request to ORC. Add PX request to ORC. Add PX request to ORC. 

Send PRBSHR command to PY. Add PX dependence on PY trans- | Add PX dependence on PY trans- 
action in ORC. action in WBC. 

Kill read to DDR. Kill read to DDR. Kill read to DDR. 

Send PRBBRD to PY. Add PX request to ORC. Add PX request to ORC. 

Add PX request to ORC. Add PX dependence on PY trans- | Add PX dependence on PY trans- 
action in ORC. action in WBC. 

Send PRBBWT to PY. Queue transaction dependence on | Queue transaction dependence on 

Add PX request to WBC. PY in ORC. PY in WBC. 
Add PX request to WBC. Add PX request to WBC. 

Retry event reacting to “NOPROBE” response: Launch read to DDR. 


RDEXR Retry event reacting to “NOPROBE” response: Launch read to DDR. 


WINV Add Addr to WT Queue and WBC. | WINV from PX passed an inflight | WINV from PX passed an inflight 
Invalidate L2 Tags for PX. PRBWIN for this block. BWT for this block. 
Invalidate L2 Tags for PX. Invalidate L2 Tags for PX. 
Add Addr to WT Queue and WBC. | Kill transaction in Write Queue, as 
the BWT takes precendence. 


rvort Should hit on ORC ety. 


WRSTRANS Error! WRSTRANS should hit on | Find FIRST ORC entry for this ad- : 

the ORC entry for the transaction | dress (ORC_CheckS). 

that caused it. Add Addr to WT Queue and WBC. 
See Table 7.22 steps L+2 and fol- 
lowing. 
(Note there may be a spurious 
WBC hit for this operation. Ignore 
it.) 

Error! Should hit on WHC entry 
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Figure 7.18: Even Bound Command/Address Arbitration Chain 


STAGES: These commands (PRBNOHIT, BWTNOHIT, BWTGO, INVDONE) are in response to probe com- 
mands arriving from other nodes. 


DONE: These commands (BWTDONE, PRBDONE) are issued in response to completion of a BWT or probe 
transaction. 


INTW: This stream sends out an INT command to deliver an interrupt to another processor segment. 
SPCL: This stream sends out a SPCL command to the DMA. 
IDONE: This stream sends out a DONE command to signal completion of an INT delivery. 


VICCAN: This stream sends out a WBCANCEL command to rescind a writeback request for a block that is now 
known to be clean. 


IORD: This stream sends out RDIO commands. 
IOWT: Surprise! This stream sends out WTIO commands. 


The arbitration passes through two stages. In stage 1, the ten sources each determine their eligibility to bid. For 
instance, the IOWT source may not bid for access to the command bus if a previous IO write is in flight, or if the 
IORD stream has an earlier IORD waiting, or if the most recent IO write sent out its data in the last 8 tics or 
so. In stage 2, all the eligible bidders compete. The highest priority bidder rotates round-robin. The round-robin 
pointer is bumped each time some stream wins the arbitration. (It is not bumped if there were no requesters, and 
it is not bumped if a memory read command is being driven because of the priority based arbitration described 
above.) 

So how could we create a starvation case? Assume some request stream A needs resource X to be eligible. Now 
assume a second stream B also needs X. If A and B both arb at the same time and B wins, A will lose. Now A 
can’t bid again until B releases its resource. If B releases its resource and then needs to rearb again, it may beat B 
again. In fact, if B releases its resource and any other stream requires resource X, A could lose again. Round-robin 
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arbitration will not prevent this kind of starvation. So, we need to make sure that if A loses a round of arbitration, 
it will eventually become the only eligible requester that requires its resource. How do we do that? 

First, we can dismiss all commands streams that require no resource at all to become eligible. This eliminates 
the LOCINVDONE, DONE, STAGES, IDONE and VICCAN streams.*These require no resources, so even 
in the worst case they only wait for the round-robin pointer to make them the highest priority choice. 

Second we should note that the remaining memory related stream WRSTRANS only requires a free TID. In 
this case, the requirement is that either TIDO or TID1 be available for a memory transaction. (If the TID is being 
used by a RDIO, WTIO, INT, or SPCL, it is still available for use by a memory operation.) Since only one memory 
transaction can be in flight at a time, and we never need to do two WRSTRANS operations at a time, there is no 
“B” request stream that could starve out the WRSTRANS stream if it was “A” in the above example. Note that 
nothing that happens with IO related operations ever contends with memory operations. So now, having dismissed 
arbitration conflicts among memory operations, we only need consider starvation among the IO operations. 

First, because of the strict ordering of IORD and IOWT operations from the core, we never allow an IOWT 
to pass an IORD or vice-versa. This means that IORD operations never contend with IOWT. So they can’t 
starve each other out. 

Alas, there has to be a fly in the ointment somewhere. The IOWT stream requires the IOWriteTID to be 
available. So does the INTW and SPCL stream. This is the lone known opportunity for starvation in the CMX. 
A pathological program could issue an IOWT request and follow it with a sequence of writes to the interrupt delivery 
or SPCL delivery register so as to prevent the IOWT from ever completing. However, we require a SYNC either 
before or after any SPCL or INTW write in order to ensure proper delivery of the SPCL operation. This would 
prevent the IOWT from starving. In any case, the IOWT was likely not performed from user mode, as we aren’t 
likely to allow user mode programs to fiddle with IO space, even if we do allow (and encourage) usermode access to 
the SPCL registers. Programs that send out back-to-back SPCLs without SYNCs get what they deserve. (A SYNC 
instruction would stall the processor and prevent further SPCL writes until the stalled IOWT had completed.) 


7.17 Definitions and Enumerations 


7.17.1 Package Attributes 
Package 


chip_cac_spec 


7.17.2 Definitions 
Defines 
CAC 


-sais____| CMD_ADDR_FIFO_DEPTH | Depth of Command /Adar FIFOs Tor all bis stops___———=i he 


PCI DATA fifo depth must be 5 to cover the fact that the PCI widget could have three BRDs, one RDEX, and two 
WTIO operations completing at one time and it takes 4 cycles to consume a FIFO entry. The DMA widget only 
needs 3 as it can remove an entire entry on every tic, and need only support four BRDs and one WTIO completion. 
If all five transactions arrived sequentially at a DMA port from the same direction, we’d peel them off in order and 
only need one slot in the queue to accomodate them. The worst case for DMA is three from the Even side and two 
from Odd. 


4A reader of the CacCmxBeh sources will note that STAGES requests all require that there be no queued victim writebacks. This 
condition is redundant, as we already ensure that no writebacks are in progress before issuing the requests from the probe control 
state machine. A similar condition on VICCAN requests is only a delaying mechanism, as we only allow one read miss in flight at a 
time. It looks redundant now, but we aren’t going to remove this logic, as all reasoning is subject to verification, and we’ve got lots of 
verification cycles on this logic. 
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7.17.3 Processor to L2 Cache Commands 


This section has been removed. 


7.17.4 L2 Cache to Processor Commands 


This section has been removed. 


7.17.5 L2 Cache to/from Coherence Controller Commands 


Enum 
CohCmd 
(170 Device Use) 


Read shared (instruction) 

Fsib0000 | RDSV | Read shared, write vicim [= 
5600100 | —RDSR | Read shared, retry J 
sbo00LI_| —_RDV | Read data exclusive, wale vidi ——__[- 
SbUOTIO| —RDEXR | Read data exclusive, retry» 


PS boI000 | WRSTRANS | White retamig shared copy —=S=~=~“*~S*~sdSSC“‘CSSOC™C™C™C™C*C*~* 
-sbo10r | WINV | Writeback and Invalidate SSCS 
-sbort10_| DONE [| WINV.INT, or SPOLiscomplte | SSCS 
SOIT _| WBCANCEL | Cancel writeback request om RDSV and RDV[__— 


-s10000_|_BRD—__| Block Read__——S—SSSCC | SSS 
b10010_| BWTNOAIT | Block Write encountered evicted blocks SS 
Ps bi0100_| BWTGO | Continue Block Wiite SSS 
-sb10101_| BWTDONE_| Block Write Complete SSCS SSS 
[spoT | _BRDR__| Block Read Remy __—=SS—S—S—CSCSCSCSCSCSCi 


Spr | PRBINV | Probetomvalidate——SC~C~S~SSCS SSCS 
[Ss brOOT_[_PRBWIN | Probe to writcback/wransfr——SCSC—S~SSSSOSCSCSCSCSCS 
-SbHOI_[PREBRD_| Probe to forward Block Read———SSC—C~CSSSCSCSCSCSCSCSCS 
Output 

: 


5DOTT00 
5’b01101 SPCL Special Command 


7.17.6 L2 Cache Coherence Widget States 


Enum 


CohState 
rbO0 
Thor | EXCL 


Db10__| SHARE 
2’b11 UNUSED | Unused encoding 
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7.17.7 L2 Segment Cache States 


Enum 


CacState 


35000 
3’b001 EXCL Exclusive 


sb010_| SHARE 
3’b110 DIRTY Different from Memory Copy 
3’b111 UPDATED | Different from Memory and Updated since last fill. 


7.17.8 L2 Cache Modified States 


Enum 
CohModState 
PbO 


7.17.9 L2 Half Block Update Tags 


Enum 
CohHalfMask 
rbI 


7.17.10 L2 Cache Interface Numbers (Bus Stop Numbers) 


This enumeration contains the physical bus stop number, used to route on the cache switch. For software 
interrupts, and addressing, the similar AddrStopNum 16.6.5 is used instead. (Thus, this table may change without 
affecting any software.) 

Enum 

CswStopNum 


7.17.11 L2 Cache Interface Numbers (Bus Stop Numbers) for TWICE9 


This enumeration contains the physical bus stop numbers (for TWICE9), used to route on the cache switch. 
This new set enumerations has been created because using the old emnumeration would mean that the constant 
for COHE could not be changed (as this would break for ICE9A) but it also can’t be redefined (since enumerations 
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don’t support this). If TWICE9 required keeping the old value for COHE, this would require significant coding 
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changes to verilog and system C code in Cac, Dma, Coh, and PMI). 


Enum 


CswStopNumTwe 


(Product) 


COHO TWC9A+ | coherence controller on odd side 
CORE1 TWC9A-+ | L2 segment for core 1 


COREO TWC9A+ | L2 segment for core 0 
DMA | TWOOA- 
CORE2 TWC9A+ | L2 segment for core 2 


CORE3 TWC9A-+ | L2 segment for core 3 
WODAF | PCT controle 
CORES TWC9A+ | L2 segment for core 5 


CORE7 TWC9A+ | L2 segment for core 7 
4’d10 CORE6 TWC9A+ | L2 segment for core 6 
4’d11 CORES TWC9A+ | L2 segment for core 9 


CORE4 TWC9A+ | L2 segment for core 4 


4’d12 CORE8 TWC9A+ | L2 segment for core 8 
4’d13 COHE TWCO9AG coherence controller on even side 


Tad [___| TWOoa 
4’d15 BROADCAST | TWC9A+ | Broadcast to all nodes (legal from COHE or COHO only 


7.17.12 Transaction IDs 


Enum 


CswTid 
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PSOTO Any op for PSO 


Any op for PSO 
Fa 
Fa 
5dr 
Fas 
Fay 
FaI0 
Fall 


Sdl2 BRD 0 for DMA 
Fdl3 BWT 0 for DMA 
Fdl4 BRD I for DMA 
5dl5___ | DMAWTI BWT | for DMA 
5'dl6 BRD 2 for DMA 
Sdl7 BWT 2 for DMA 
Fdl8 BRD 3 for DMA 


FaI9 
Fa20 
Fal 
Faz 
FaBB 
Fda 
FAD 
Faz 
5d27 BWT 3 for PCI 


: INT used for all INT commands from all blocks 


Or 
oo 
w 
part 


7.17.13 Transaction IDs for TWICE9 


Enum 
CswTidTwe 


(Product) 
PSOTO TWC9A4 Any op for PSO 
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6’d18 
6’d19 
6’d20 
6’d21 
6'd22 
67d23 
6'd24 
67d25 
6'd26 
6'd27 
67d28 
6’d29 
6’d30 
6’d31 
6'd32 
67d33 
6'd34 
6'd35 
6'd36 
6'd37 
6'd38 
6’d39 
6’d40 
6'd41 
6'7d42 
6'd43 
6'd44 
6745 
6'd46 
6'd47 
6'd48 
6’d49 
6’d50 
6'd51 
67d52 
67d53 
67d54 
67d55 
6'd56 
6'd57 
6'd58 
6'7d59 
6’d60 
6’d61 
6'd62 
6'd63 


PSITD 
PSITS 
PS5TO 
PSST 
PS5TD 
PSTS 
PSOTO 
PSOTI 
PSOT? 
PSOTS 
PS7TO 
PSTTI 
PSTD 
PSTTS 
PSST 
PSST 
PSST? 
PSBTS 
PSOTO 
PSOTI 
PSOTD 
PSOTS 
MARDO 
MAWTO 
MARDI 
MAWTI 
MARD2 
MAWT2 
MARD 
MAWTS 
TAR 
TAWT 
MARDS 

TAWT 
MARDS 
MAWT 
PCIRDO 
PCIWTO 
PCIRDI 
PCIWTI 
PCIRD2 
PCIWT2 
PCIRDS 
PCIWTS 
INT 


4) 4 


4 Hy] Ay) Hy 4 


2 
4 


Ss) 


Oo 
oo 
4 


wiles 
S| = 

Oo 
|. 
| Rael Rae! 


Oo 
K< 
on 
4 


s) 


a 


| e'd4l_ | DMAWTO | T 


7.17.14 Address Tag and Index Fields for L2 and Coh Tag and Data arrays 


Defines 
CADDR_FLD 
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647h040 =| BANK_SEL_MSK | Which bit selects the “bank” (i.e. EVEN or ODD side COH) 


16’d10 HASH_WIDTH How wide is hashed portion of the tag index? 


16’d7 HASHLO_START | Where does the low half of the tag hash field start? 
16°d17 HASHHISTART | Where does the hi half of the tag hash field start? 
16’d18 TAG_WIDTH How wide is the stored address tag? 


7.17.15 L2 Cache Useful Dimensions 


Defines 
CAC_DIM 
Definition 
16’7d2048 L2TAGARR_SIZE | Number of entries in L2 Tag Array 


16’d8192 | L2DATWARR_SIZE | Number of Quadwords (16 bytes) in each WAY of the L2 Data Array 


7.17.16 Coherence Engine Useful Dimensions 


Defines 
COH_DIM 


Constant 


8'd27 MAX_TID 


7.17.17 Coherence Engine Useful Dimensions for Twice9A 


Defines 
COH_DIM_TWC 


Product) 


TWOOA$ 


ROQ-ENTRIES 


7.17.18 Coherence Engine L2 Tag Array Fields 


Defines 
COH_MTAG 
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Constant 
Tow bit of Way 0 Tg 
Tow bit of Way 1 Tag 
Width of a Tag Fold 


Low bit of Way 0 State 

Low bit of Way 1 State 

How wide is the state 

Does the assoc proc OWN this block? 
Does the assoc proc OWN this block? 
ECC bis 


7.17.19 SPCL Address Request Fields 
Defines 


SPCL_ADDR 
Constant Definition 
ADDR2_LOW _| Low bit of ADDR2 ficld 
ADDR2.WIDTH | ADDR? Field Width 
8’d16 ADDR1_LOW Low bit of ADDR1 field 


ADDRI1_WIDTH | ADDRI1 Field Width 
8’d20 BSN_LOW Destination Bus Stop Number low bit 
BSN_WIDTH Destination BSN Field Width 


7.17.20 SPCL CSW Command Fields 
Defines 


SPCL_CMD 


Constant Definition 


ADDR2_LOW | Low bit of ADDR? field 


ADDR2_WIDTH | ADDR2 Field Width 
816 ADDRI_LOW | Low bit of ADDRI field 
ADDRI_WIDTH | ADDRI Field Width 


Low byte of Data low bit 
Tow data width 

Rest of DAT fd 

Width of upper data field 


7.18 Registers 


7.18.1 Cache Probe Control Register 


The cache probe registers are used to generate a L2 intervention into the L1, by request of the local code. This 
is implemented only in the verification model, for testing purposes. 


Register 


R_CacxProbeCtlMagic 


Attributes 


-noregtest -noregdump 
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Address 


0x00_0400 (plus base address) 


3l Done R Intervention valid. Cleared on writing the Prb bit, set 
Peres eee ae when the intervention has completed. 
p30 [aR intervention resultedim hit SSS 
p29 [Diy R__ [0 |_ intervention vesulted im drigs SSS 


See en Ra Intervention resulted in locked return. 


|| Reserved 


IOHoldoff | R' Inhibit IO write acks until this prbe has been acknowl- 
il asl el HME | ~ tarlominaiamenecaced 
26:1 | Delay Probe delay. Wait this number of cycles after _Prb bit is 


pe oT When written one, create a proba as specified. 


7.18.2 Cache Probe Address Register 

The cache probe registers are used to generate a L2 intervention into the L1, by request of the local code. This 
is implemented only in the verification model, for testing purposes. 
Register 

R_CacxProbeAddrMagic 


Attributes 


-noregtest -noregdump 


Address 
0x00_0404 (plus base address) 


31:3 | AddrL RW Address Low. Address[31:3] to generate probe to. Verifi- 
cation implementaion only. 


2:0 | AddrH RW Address High. Address|34:33] to generate probe to. Veri- 
fication implementaion only. [35] is always zero. 


7.18.3 Cache Probe Random Address Registers 


The cache probe registers are used to generate a L2 intervention into the L1, by request of the local code. This 
is implemented only in the verification model, for testing purposes. 


Register 
R_CacxProbeRandAddrMagic|7:0] 


Attributes 


-noregtest -noregdump 


Address 
0x00_0500-0x00_053F (plus base address) (Add 0x8 per entry) 
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| 36 | Enable | RW [0 |_| Send probes to the address contained in Addr 


35:5 | Addr RW Address. Address[35:0] to generate probe to. Verifica- 
tion implementation only. In ICE9A, bits [4:0] is ignored 
and treated as 0, since all probes are aligned to L1 cache 
blocks. Starting in ICE9B, bits [5:0] is ignored and treated 
as 0, since all probes are aligned to L2 cache blocks. 


7.18.4 Cache ECC Injection Register 


Controls BFM backdoor ECC injection to L1 I and D cache RAMs. This is implemented only in the verification 
model, for testing purposes. 


Register 
R_CacxInjEccMagic 


Attributes 


-noregtest -noregdump 


Address 
0x00_0408 (plus base address) 


FlipAllLinesSoon | RW [0 [|_| Flip one randomly selected bit in every cache block 


| 0 | StartRandomFlips | RW = |0 |_| Start continuous random LI parity / ecc single-bit flipping 


7.18.5 I/O Addresses in L2 Segment 
Defines 
CAC_IO 


36’hE_9000_0000 | WTIOADDR | I/O writes are implemented as WTIO command, RDIO command, 
then data. When the RDIO is sent back to the initiator, the Addr 


must be set to CAC_IO_WTIOADDR. 


7.18.6 Interrupt Cause Register 
Register 
R_CacxIntCr/[7:0] 


Attributes 


-kernel 


Address 
0x00_0000-0x00_003F (plus base address) (Add 0x8 per entry) 


POSELOC| = — || |}O |_| Reserved. Read as zero 


sore ao If read as 1, correspoinding interrupt is asserted. Write 1 
to clear. Note when clearing Active, the _Overflow bit is 
also cleared, see bug3343. 


[3 [OVERFLOW [RWIC_[0___ |__| Interrupt Cause Register Overfow. ——S—S 
pro [cause PR [0 | [Interrupt Cause SSCS 
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When the Interrupt Cause register is over-written (that is, on the arrival of an ICR write or INT command 
from the CSW for an ICR whose ACTIVE bit is set) the OVERFLOW bit will be set, and all other bits will be 
left unchanged. 


Writing 1 to ACTIVE will clear ACTIVE. Writing 1 to OVERFLOW will clear OVERFLOW. A write to either 
bit will leave CAUSE as it was. 


7.18.7 Interrupt Delivery Register 
Register 


R_CacxIntDel 


Attributes 


-kernel 


Address 


0x00_1000 


resis| STR [0 | |Reered SSOSC~—SCSCSCSCSC~‘dY 
rist2 [DEST [|W [0 |__| Bns stop number of target segment SSCS 


pis [TCRIDX [WwW _[0 | __[ index into target segments ICR set. ——SSSS—S 
P70 [CAUSE _[W [0 [Interrupt Cause SSCS 


7.18.8 Slow Interrupt Selection Register 
Register 


R_CacxSlIntSel 


Attributes 


-kernel 


Address 


0x00_00C8 (plus base address) 
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12 CswUnCorEccIntEn | RW Uncorrectable CSW ECC Interrupt is passed on to pro- 
eee ETT casera igs 
11 CswCorEccIntEn RW Correctable CSW ECC Interrupt is passed on to processor 
ee TE Linmislings 
10 L2UnCorEccIntEn RW Uncorrectable L2 ECC Interrupt is passed on to processor 
Oe EE [istaree: 
L2CorEccIntEn RW Correctable L2 ECC Interrupt is passed on to processor 
ee 
LACSIntEn Assertion of LAC (OCLA) Slow Interrupt is passed on to 
pL || recess ts 


PMISIUntEn Assertion of PMI Slow Interrupt is passed on to processor 
INT[3] IRQ5 


SCBSIUIntEn RW Assertion of SCB Slow Interrupt is passed on to processor 
INT[3] IRQ5 


FLSIntEn RW Assertion of Fabric Link Transciever Interrupt is passed 
on to processor INT[3] IRQ5 


DMASIIntEn Assertion of DMA Slow Interrupt is passed on to processor 
INT[3] IRQ5 


FSWSIUntEn Assertion of FSW Interrupt is passed on to processor 
INT[3] TRQ5 


fe eee Assertion of UART Interrupt is passed on to processor 
[eon] Ns) RO 

COHESIIntEn RW Assertion of COHE Interrupt is passed on to processor 
INT([3] IRQ5. COHE asserts this interrupt on occurrence 
of an ECC error or DDR Calibration Timeout. 

COHOSIIntEn RW Assertion of COHO Interrupt is passed on to processor 
INT[3] IRQ5. COHO asserts this interrupt on occurrence 
of an ECC error or DDR Calibration Timeout. 


7.18.9 Slow Interrupt Status Register 


For more details, see the “Interrupts, Again” section of the Processor Segments chapter, (section 6.19.6). 


Register 


R_CacxSlIntStat 


Attributes 


-kernel 


Address 


0x00_00D0 (plus base address) 
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pis [CswtnCorkce [RWIC [0 |__| Unconrectable BOO detected on bansior hom OSW | 
PT [CswCorkee [WIC [0 |__| Correctable BOC detected on transfer rom CSW | 
P10 [T2UnCorkee [RWIS [0 |__| Uncorrectable EOC detected on transfer from £2 Cache_| 
Po [12Corkice RWIS [0 |__| Correctable BOC detected on transfer from 12 Cache | 
rs_| LACSInt [R10 || LAC (OCIA) Slow Interrupt asserted 


PCI/PMI Slow Interrupt asserted 
SCB Slow Interrupt asserted 


— 
a 
[| Rabie Tink Transciever Interrupt assorted 
—- 
—4 
Loa 


Pps — 
pa _[DMASIng—_[R_ [0 [_[ DMA Slow Interrupt assorted SSCS 

3 [FSW interrupt is asserted SSS 
Ei 


COHO Interrupt is asserted 


R 
R 
FSWSilnt 


7.18.10 L2 Cache ECC Mode Register 
Register 


R_CacxEccMode 


Attributes 


-kernel 


Address 


0x00_0100 (plus base address) 


PS [Te Tegdetima [RW [0 |__| Enable HCO Brvor Detection on L2 tag acoosses =| 
PT [T2TegCorEna [RW [0 |__| Enable ECC Error Comection on 12 tag accesses | 
[3 CswDetima [RW [0 |__| Enable BCC Error Detection on CSW transis | 


| 2 | CswCorEna | RW [0 | | Enable ECC Error Correction on CSW transfers 
L2DetEna }RW = [0 ~~ | _[ Enable ECC Error Detection on L2 transfers 
}O | L2CorEna pRW [0 6]. Enable ECC Error Correction on L2 transfers 


7.18.11 L2 Cache ECC Test Register 
Register 


R_CacxEccTestDat 


Attributes 


-noregtestcpu_wr -kernel 


Address 


0x00_0108 (plus base address) 
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as L2DrvBadTag1 Len = bit 1 of all future addresses written to the L2 Tag 
arr 
ial L2DrvBadTag0 ee ae ie lip a 0 of all future addresses written to the L2 Tag 
array 
CswDrvBadDatl | RW Flip bit 1 of all words written to the CSW via IO write 
or cache block a Ulspipcoieul 


[0 |__| Flip bit 0 of all words written to the CSW Sid 
1 L2DrvBadDat1 Flip bit 1 of all even words for all future 32 byte blocks 
El baci ill A ID b= hfe 

L2DrvBadDat0 Flip bit 0 of all even words for all future 32 byte blocks 
Pe lpereeenaie ye We ol written into the L2 data array from L1 writebacks. 


7.18.12 L2 Cache Status Register 


Register 
R_CacxEccStat 


Attributes 


-kernel 


Address 
0000110 (plus base address) 


[Definition —SSCS—C—~—~—CSCSCSCSY 


L2TagMultErr | RW1C Multiple ECC errors have occurred on an L2 tag lookup. 
ee ATT 
L2TagCorErr RW1C Correctable error detected on an L2 tag lookup. Write 1 
Cee Ed eee 
L2TagUncorErr | RW1C Uncorrectable error detected on an L2 tag lookup. Write 
ee) ee 
5) CswMultErr RW1C Multiple ECC errors have occurred on a CSW transfer. 
Cee PT [iets daar TS ST 
Ce eee eee 
clear. 


3 CswUncorErr RW1C Uncorrectable error detected on a CSW transfer. Write 1 
lomo ale i eee ae | 
2 L2MultErr RW1C Multiple ECC errors have occurred on an L2 transfer. 
Pees i Jee eee 
1 L2CorErr RW1C Correctable error detected on an L2 transfer. Write 1 to 
Pe eee ee 
fee ee eee 

to clear. 


7.18.13 L2 Cache Data ECC Error Address Register 
This register gets loaded on the first ECC error signaled by either the DATA array ECC checkers. 


Register 
R_CacxL2EccAddr 


Attributes 


-kernel 
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Address 


0x00_0118 (plus base address) 
ErrAddr |R = =[0 | __ {[ Address of word for first detected ECC error in L2 Cache 


p20 [SiR [0 |__| Reserved. 


7.18.14 CSW ECC Error Address Register 

This register gets loaded on the first ECC error signaled by the CSW ECC checker. It is cleared when the 
corresponding correctable or uncorrectable error bit is cleared. 
Register 

R_CacxCswEccAddr 


Attributes 


-kernel 


Address 
0x00_0120 (plus base address) 


Dafition 
ErrAddr [R= [0 | Address of word for first detected ECC from CSW transfer 


Ep) 
p20 Sd Rf Reve 
7.18.15 L2 Cache Tag ECC Error Address Register 


This register gets loaded on the first ECC error signaled by the Tag ECC checker. It is cleared when the 
corresponding correctable or uncorrectable error bit is cleared. 


Register 
R_CacxTagEccAddr 


Attributes 


-kernel 


Address 
0x00_0128 (plus base address) 


ErrAddr /-R = =[0 |__| Address of word for first detected ECC from a Tag lookup 


a a ne 7 


7.18.16 L2 Cache ECC Error Syndrome Register 


Each syndrome field is only meaningful if the corresponding correctable/uncorrectable error bit is set. 


Register 
R_CacxEccSynd 


Attributes 


-kernel 


May 14, 2014 451 Rev 51328 


SiCortex Confidential CHAPTER 7. L2 CACHE COHERENCE AND SWITCH 


Address 


0x00_0130 (plus base address) 


CswSyndHi |R ={[0 ~~ [|_| Syndrome from the high word of a CSW transfer. 


CswSyndLo | R == [0 ~~ | ~——__—[_ Syndrome from the low word of a CSW transfer. 


The syndrome is only captured for ECC errors from CSW transfers. (This gives us insight into which bits are 
failing on DIMMs. This is more valuable than knowing which bits are failing in on-chip RAMs. The register is 
loaded on the FIRST detected CSW ECC error after the CorErr and UnCorErr bits have been cleared. 


7.18.17 L2 Cache Send SPCL Request Address Range 


The SPCL addresses must span a range of 16 maximum size physical pages (64kB), so that each page can be 
mapped by the kernel into a separate user process. To send a SPCL, the program does a store instruction to an 
address in the SPCL request address range. The address of the store, and the data that is stored, are combined to 
produce the value that is driven onto the CSW Address bus along with the SPCL command. The CSW address 
encoding is described in detail in section 7.10.6. 

Note that these addresses must be on separate physical pages from all other local CAC control registers as these 
will be accessible from user mode programs. 


Register 
R_Spcl[0x3F_FFFF:0] 


Attributes 


-noregtest -kernel 


Address 
0xE_BE00_0000-0xE_BEFF_FFFC 


| 23:0 | SpclData [|W = =|0 | | Data to be delivered to DMA engine via SPCL command. 


7.18.18 Coherence Engine ECC Mode Register 


Register 
R_CohxEccMode 


Attributes 


-kernel 


Address 
0x00_0000 (plus base address) 


2 DetDblEna | RW Enable ECC Error Detection on tag lookups. When as- 
serted, any detected double bit error will trigger a slow 
interrupt from this coherence widget. (See 7.18.8.) 


1 DetSnglEna | RW Enable ECC Error Detection on tag lookups. When as- 
serted, any detected single bit error will trigger a slow 
interrupt from this coherence widget. (See 7.18.8.) 
PO] Corbina [RW [0 |_| Enable EOC Error Comrection on tag lookups ‘| 


Programmer’s note: Bugzilla 1990 finds that the behavior of the COH when CorEna is clear 
could be unpredictable when an ECC error is detected in a master tag array. For this reason, the 
CorEna bit should always be set to 1 when the COH is in use. 
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7.18.19 Coherence Engine ECC Test Register 
Register 
R_CohxEccTestDat 


Attributes 


-kernel 


Address 


0x00_0018 (plus base address) 


DrvBadDatl | RW [0 | si _s« Filip bit 1 of word 0 in any tag written into any tag array 


| 0 | DrvBadDatO| RW {0 = [| _- Flip bit 0 of word 0 in any tag written into any tag array 


7.18.20 Coherence Engine ECC Status Register 
Register 
R_CohxEccStat 


Attributes 


-kernel 


Address 
0x00_0020 (plus base address) 


2 MultErr RW1C While either CorErr or UnCorErr was set, a subsequent 
ECC (single or double) error was detected. Write 1 to 
clear. 


CorErr RW1C Correctable error detected on a tag lookup. Write 1 to 
clear. If this bit and the DetEna bit in the CohxEccMode 
register are both set, the Coh will send a slow interrupt 
to each processor segment. One or more tag arrays may 
have reported a single bit error in a given cycle. 
clear. 


Note fal creel MultErr is Eve asserted if two or more TAG arrays report an ECC error in the same cycle. MultErr 
is only asserted if a new ECC error occurs while CorErr or UncorErr is alread asserted. 


7.18.21 Coherence Engine ECC Error Address Register 


This register gets loaded on the first ECC error signaled by the CSW ECC checker. It is updated only if CorErr 
and UncorErr are both clear. 


Register 
R_CohxEccAddr 


Attributes 


-kernel 
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Address 
0x00 0028 (plus base address) 


[Definition —SOSCSCSCSCSCSCSCSCSC~*d 


ErrAddr iil Facelehadeied Address <34:7> of Block for first detected ECC tag 
lookup 


2:0 | Array Which tag array had the problem? If multiple arrays re- 
ported an error, the lowest numbered array is reported 
here. 


7.18.22 Twice9+ Coherence Engine ECC Error Address Register 


For Twice9+ this new 64 bit SCB register gets loaded on the first ECC error signaled by the CSW ECC checker. 
It is updated only if CorErr and UncorErr are both clear. 


Register 
R_CohxEccAddrTwcePlus 


Attributes 
-kernel -Product=TWC9A+ 


Address 
0x00_0050 (plus base address) 


47:7 | ErrAddr R TWC9A+ | Address <47:7> of block for first detected tag lookup 
ECC error. 
Some number of MSB bits are padded with zeros depend- 
ing on the design revision. 


Array R TWC9A+ | Identifies which tag array had the problem. If multiple 
arrays reported an error, the lowest numbered array is 
reported here. MSBs are padded with zeros depending on 
the number of tag arrays in the specific design revision. 


7.18.23 Coherence Engine ECC Error Syndrome Register 


Register 
R_CohxEccSynd 


Attributes 


-kernel 


Address 
0x00_0040 (plus base address) 


7:0 | ErrSyndrom | R Syndrome of first detected ECC error from Master Tag 
Lookup 


7.18.24 Coherence Engine Active Processor Segment Register 
Register 
R_CohxNumSegs 
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Address 
0x00_0048 (plus base address) 


Definition 
6:3 | ActiveSegCountTwe | RW 10 TWC9A+ | Number of L2 Segments currently enabled for oper 
Must be either 1 or 10. 
2:0 | ActiveSegCount RW Number of L2 Segments currently enabled for oper 
Must be either 1 or 6. 


The NumSegs register allows the chip to be configured as a uniprocessor, if necessary. The value in this register 
must be set prior to initial program load. The value from this register is loaded into the appropriate INVDONE 
counter whenever the COH sends out a PRBINV request on behalf of a processor or PMI device. (A transaction 
that causes a PRBINV is not complete until all active L2 segments have sent an INVDONE signal to the appropriate 
COH. See section 7.12.3.5.) 


7.19 Register Allocation 


This chapter instantiates the six copies, plus the local copy of CAC registers. It also instantiates the two sets 
of COH control registers. 


7.19.1 CacLoc 
Register 
R_CacLoc* : R-Cacx* 


Address 
OxE_9E00_0000-0xE_9EFF_FFFF 


7.19.2 Coho 
Register 
R_Coho* : R-Cohx* 


Address 
0xE_0000_0000-0xE_O0FF_FFFF 


7.19.3 Cohe 
Register 
R_Cohe* : R-Cohx* 


Address 
0xE_0900_0000-OxE_O9FF_FFFF 


May 14, 2014 455 Rev 51328 


SiCortex Confidential CHAPTER 7. L2 CACHE COHERENCE AND SWITCH 


May 14, 2014 456 Rev 51328 


Chapter 8 


Memory Controller 


[Last modified $Id: memctl.lyx 50693 2008-02-07 16:01:46Z wsnyder $] 


8.1 Overview 


The ICE9 chip has two built-in memory controllers, each of which interfaces to one 1-GB, 2-GB, 4-GB, or 8-GB 
72-bit DDR2 SDRAM DIMM. The chip accomodates memory clock rates of 267, 333, and 400 MHz, corresponding 
to data rates of 533, 667, and 800 MHz, respectively. 

The memory controller functionality is partitioned accross two functional units, DDR and DDP. The DDP unit 
contains the DDR2-PHY, which is implemented as a hard IP macro (purchased from Esilicon). The DDR Unit is 
composed of the following subsections: 


1. DDI - Interface block between the DDR2 Controller (DDC) and the Coherence Controller (COH). Designed 
by SiCortex. 


2. DDC - DDR2 SDRAM controller IP logic block. Purchased source code from Northwest Logic and synthesized. 
3. DDD - Read datapath interface to DDR2 PHY 


The two instances of the DDR unit are referred to as the “even” and ’odd” DDR units. The “even” DDR instance is 
ddre (sometimes called ddr0), while the “odd” instance is ddro (sometimes referred to as ddr1). The even instance 
is on the east side of the die. The instances are distinguished by the static input pin tie_ddrx_id (for ddre/ddr0 it is 
tied to 1’b0(GND), while for ddro/ddr1 it is tied to 1’bl (VDD)). The two instance of the DDP unit are similarly 
named, however their is no need for a static signal to distinguish them since a given instance of DDP does not need 
to know whether it is the “even” or “odd” instantiation. 


8.2 Differences, Bugs, and Enhancements 


8.2.1 Product and Chip Pass Differences 


1. ICE9B fixes the DDR unit to support IO driver calibration before the DRAM initialization sequence, bug2276. 
In ICE9A the Ddr/Ddp units currently only support updating values into the IMP_P_HV/3:0] and IMP_N_HV{3:0] 
inputs of the DDR2 IO cells during one of the mission mode time CalModes. When SoftReset is asserted the 
PHY puts in default strong values (low impedence biased) into these. 


2. ICE9B fixes some of the ODT on/off range values, bug2401. The NWL controller was supposed to support 
the following range of ODT turn on/off times for Ice9a’s DDR-Phy: ON time range: controlled by Ddrx- 
PhyCfg2_AsicDqsOdtOn and DdrxPhyCfg2_AsicDqOdtOn -2.5 clocks <-> 0 clocks (in half cycle increments) 
relative to the start of the read preamble OFF time range: controlled by DdrxPhyCfg2_AsicDqsOdtOff and 
DdrxPhyCfg2_AsicDqOdtOff -1.5 clocks <-> 2 clocks (in half cycle increments) relative to the start of the 
read preamble. However, the bug causes the -2.5 and -2 clocks turn on times to NOT work with turn off 
times of 1.5 and 2 clocks. 


3. TWCQ9A fixes access to any SCB bus slave hanging while the DDR controller is in reset, bug2928. 
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4. NEED IMPL: TWC9A drops support for unbuffered DIMMs. 


8.2.2. Known Bugs and Possible Enhancements 


1. Calibration Mode 2 can cause Ddi to hang waiting for Powerdown, see bug2013. When setting AutoCalUpdate 
in cal mode 2 (update during prechargePowerdown), the Ddi can hang. This is caused when a request is at the 
head of the queue requesting to be sent to the controller at the time we start the calibration update process. 
The calibration logic spins in place waiting for powerdown entry. However, this pending request causes the 
powerdown counter to be cleared on every cycle, which blocks the Ddr from ever entering powerdown mode. 
To workaround, do not use calibration mode 2. 


2. The DDR bank address could be changed to better optimize page hits, bug2068. 


8.3 General Description 


8.3.1 Clocks 


The memory interface has two clock domains: CCLK and DCLK. CCLK is the same clock used on the core 
side of DDR (COH and CSW units) and logic which runs on the DCLK which is same clock used by DDC, the 
DDR2-PHY and the DDR2 SDRAM DIMMs (Note that some of the logic really runs off of DM90CLK which is a 
minus 90 degree shifted version of DCLK). 

The required relationship between the clock is: 

CCLK <= DCLK < (2 * CCLK)/1.05 

Note that since DCLK (or DM90CLK) is also used for driving clocks to the DIMM and the PHY’s DLLs it has 
the addition restriction that 125MHz <= DCLK <= 465MHz (125MHz correlates to the maximum tCK cycle time 
supported by target DIMMs and 465MHz is the maximum frequency supported by the True Circuits analog DLLs 
used in the PHY). 


Table 8.1: Recommended DCLK to CCLK relationships 


267 MHz | 140 MHz - 267 MHz 
333 MHz | 175 MHz - 333 MHz 


400 MHz | 210 MHz - 400 MHz 


Note that the Analog Bits PLL used on the ASIC drive out a two clocks at DCLK frequency: PLLOUT_1 and 
PLLOUT_2 which is shifted positive 90 degress relative to PLLOUT_1. Thus DCLK must be tied to PLLOUT_2 
and DM90CLK tied to PLLOUT_1 in order to achieve the desired minus 90 degree shift. 


8.3.2 Reset and Initialization 


Startup sequence for the DDR interface to come up correctly.which will cause R-DdrxDdcDdpSoftReset to 
assert) 

1. At startup, power will be brought up for the ICE9 and for the DIMMs (in accordance with JEDEC standard 
JESD79-2B 2.3.1a (page 9)). 

2.Global reset will be asserted from before the start of power-up and kept asserted during power-up. (This is 
to address the JEDEC mandate of attempting to maintain CKE below 0.2*VDDQ and ODT at a low state during 
power-up (they are asynchronously pulled low when reset is asserted). 

3. The dclk resets (reset_elder_l and reset_eldor_l) must remain asserted for at least lus after the power ramp 
has been completed. (This is a requirement of the analog DLLs used in the DDR2-PHY). 

4. After all clocks are appropriately configured and stable (at least those relevent to memory operation: cclk, 
d0clk, diclk, dOm90clk, dlm90clk) deasset the cclk and dclk resets. 

HERE NEED TO ADD CONFIGURING OF CK IO DRIVE STRENGTH THEN RELEASE THE RESET 
FOR THE CLOCK FLOPS 

5. The deassertion of the dclk resets will cause clocks to be driven to the DIMMs (JEDEC requires a min of 
200us of stable clock, some or all of which can be satisfied in the shadow of steps 6 -> 11, which would reduce the 
delay value required by R-DdrxDdcMemCfg3_Delay). 
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Note 5-1: The inital value of R-DdrxDdcDdpSoftReset will keep the memory controller and DDR2-PHY IP 
blocks (DDC/DDP) in reset. 

Note 5-2: The initial value of R.DdrxDdiMemLoopBack we be such that any memory references received by 
the DDR units will be looped back such that they receive completion notification. 

6. Write a 0 to R-DdrxDdpDLLReset to deassert reset to the PHY DLLs. 

Note 6-1: The minimum total assertion time of R-DdrxDdpDLLReset is lus after the power ramp completes 
(clocks need to be stable for at least a few cycles before this reset is deasserted). 

Note 6-2: After R.DdrxDdpDLLReset is deasserted no reads can go out to memory for 500 cycles while the 
DLLs are possibly unlocked. 

7. Based on data obtained from the DIMMs Serial Presence Detect through the on die I2C Master Controller 
and from data on the DIMM configuration provided via the Module Service Processor (MSP), the boot pro- 
cessor will then write the CSR registers R-DdrxDdcMemCfg1-5, R-DdrxDdcDIMMODT, R_DdrxDdpODT, and 
R_DdrxDIMMSize via the SCB bus. The boot processor may also write the registers R-DdrxDdiMifCfg1-2 and 
R_DdrxDddRdDelay, otherwise the defaults will be used (R-DdrxDdiMifCfg1-2 can be modified via the SCB at 
runtime also). 

8. Write appropriate values to R-DdrxPhyCfg1-3 and R-DdrxDddRdDelay if the defaults prove inadequate. 

9. The values of R-DdrxDdpDLLLane0-8 will need to be set. This step can be satisfied with known good values 
or some values which be adjusted as decribed in the section below “PHY Read Path DLL Calibration”. 

10. The boot processor will them write a 0 to R-DdrxDdcDdpSoftReset to deassert the soft reset to DDC and 
DDP. 

11. After the boot processor has insured that there are no outstanding read or write requests (i.e. no 
TIDs are in flight (this may involve some sequence of memory ordering directives)), it will then write a 0 to 
R_DdrxDdiMemLoopBack. 

12. Once the DDC / DDP soft reset is deasserted, DDC will begin issuing an initialization sequence compliant 
with the JEDEC standard, and DDI will begin queuing up read and write requests. 


13. Issue a memory test sequence (note that failure of the memory test must not be considered a fatal startup 
error such that it blocks testing to calibrate the PHY DLLs. 
14. Clear memory (write Os to all locations). 


8.3.3 Serial Presence Detect 


DDR2 SDRAM memory DIMMs interfacing to ICE9 must implement Serial Presence Detect in accordance with 
JEDEC Standard No. 21-C. Details discussed herein (in particular the SPD byte #s address mapping), reference 
the preliminary publication of “Appendix X: Serial Presence Detects for DDR2-SDRAM (Revision 1.2). 

On the board, the even side DIMM (on the east side of the chip and interfacing to ddre / ddr0) will be hard 
coded with its SDA[2:0] inputs tied to 000, resulting in an I2C address of 0x50, while the odd side DIMM will have 
its SDA[2:0] inputs tied to 001, resulting in an I2C address of 0x51. 


8.3.4 PHY Read Path DLL Calibration 


For detailed structural information on the DLL used in the PHY, see the corresponding subsection of the “DDP 
Unit - DDR2 SDRAM PHY IP Block” section of this specification. This section describes the process for software 
to figure out optimal DLL settings for each of the 9 byte lanes of each DDR interface. The process is for software 
to sweep through DLL settings, doing a read with each value, to figure out an eye window. The center of the eyes 
will point to the best DLL settings. There are a number of issues which need to be addressed with software and 
hardware support: 

1. The processor running the software doesn’t see the ECC. To address this, the hardware includes CSRs which 
capture the ECC value of data transfered in association with read requests to memory. 

2. An incorrect DLL settings can result in the PHY not returning any read data. To deal with this the hardware 
has a mode (controlled with R_DdrxDdiRdTimeOutAutoComplete) to prevent hanging. Some clean up of internal 
state is required before the next read access attempt (controlled with R-DdrxDdiRdPathRst). 

3. An incorrect DLL setting can result in the PHY returning incomplete read data. R-DdrxDdiRdPathRst is 
used in between read attempts to insure the read datapath is returned to a known good state before attempting 
the next read. 
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8.3.4.1 Overview of DLL calibration process 


Since each byte lane has two DLLs, the basic idea is to fix the DLL setting for one of the DLLs (referred to 
as the reference DLL). Do a number of reads as the other DLL is sweeped across a range which is expected to 
include its eye. Then change the reference DLL and again do reads while sweeping the other DLL. Since each DLL 
has 160 steps, it would take alot of reads to sweep the entire space. We can reduce the search window because we 
know that Slavel will need to be close to 1/4 of the reference cycle. Based on the analysis provided in the DLL 
subsection of the DDP Unit section of this specification, it is recommended that Slavel be used as the reference 
DLL, and it should sweep from 0 to 38. The Slave 0 DLL needs to sweep a range which covers the min to max 
trace length delay for byte lanes. The Slave 0 sweep range is recommended to be 1-134. 


8.3.4.2. DLL Calibration flow 


Suggested DLL calibration flow (Note that these steps need to be executed for both of DDR interfaces. Total 
calibration time can be reduced by doing them in parallel, but care should be taken to insure they don’t alias to 
the same address in any of the cache levels). 


1. Go though reset and initalization sequence as discussed above. 

2. Set R_DdrxDdiECCCaptureEnable_EnableRdECCCapture 

3. Set R_.DdrxDdiRdTimeOutAutoComplete_Enable CSR to enable auto completion of reads that hang. 
4. Issue a write of a signature pattern such that the write is pushed all the way to DRAM. 


Note 4-1 The signature should be chosen carefully so that the each of the 9 byte lanes recieves unique data over 
the 8 bursts of the read returned from the DIMM. Especially note we want the 8 burst of the ECC to be unique 
also, so that pattern accross each 8B chunk should factor that in. 


5. Wait for the write to complete (TID is released). 
6. Issue a read to the same address as the previous write, such that the read is issued all the way to DRAM. 


7. A few cycles after the read data is driven onto the CSW bus, copies of the ECC bits are written into the 
CSRs R_DdrxDdiRdECCCapture0-1. 


8. Wait for read data to return to the processor. 


9. Compare the read data with the expected value. Use the SCB bus to access R-_DdrxDdiRdECCCapture0-1. 
A byte lane must compare correctly for all of its 8 transfer bursts. 


10. Check R_-DdrxDdiRdTimeOutAutoComplete_RdHang, if it is set then interpret this to mean that all the 
byte lanes failed for the given set of DLL settings. 


11. Based on the results of steps 9 and 10 log the sucess/failure result for each of the 9 byte lanes. 


12. Write new values to R-DdrxDdpDLLLane0-8_Slave0Adj and possibly R-DdrxDdpDLLLane0-8_Slavel Adj. 
(according to the DLL spec it takes “a couple of cycles” for the DLL to operate glitch free at the new settings, the 
time to execute steps 13-15 should more than account for this). 


13. Write a 1 to R-DdrxDdiRdPathRst. 

14. Write a 0 to R-DdrxDdiRdPathRst. 

15. Clear R-DdrxDdiRdTimeOutAutoComplete_RdHang (it is W1C). 
16. Loop back to step 6. 


8.3.5 DIMM Requirements 


1. 240 Pin DDR2 SDRAM Unbuffered or Registered DIMM. 

2. x72 DIMM (72 total data pins, 64 data plus 8 check bits (referred to as ECC DIMMs). 

3. DRAM chips on the DIMM are x8 chips (9 on single rank DIMMs, 18 on dual rank DIMMs), and are one of 
the following sizes: 512 Mb, 1 Gb, 2 Gb, or 4 Gb. Note that this implies the chips have 4 or 8 banks (2 or 3 bank 
address bits) and 10 column address bits. 

4. Transfer rate requirement: 266, 333, or 400 MHz tCK. Note 266 MHz may not be supported in systems 
where CCLK is faster than 266 MHz. 
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Table 8.2: Supported memory configurations per DDR interface (half of the total main memory connected to each 
ICE9 chip). 


Note that 4 rank configurations are not targeted because the DDR2-PHY is not designed to operate at full speed 
with the loading of a 4 rank configuration. 


DIMM Configuration | DRAM chips Target Configuration 
1GB (2 rank) * 18-512Mb (64Mx8) chips 


18-2Gb (256Mx8) chips 
36-1Gb (128Mx8) chips 
) 


chips 


* Note that this configuration requires setting R-DdrxDdcMemCfg3_Bankbits = 0 


8.3.6 Addressing 


The ICE9 chip has a 64GB address space, 32GBs of which is for main memory (cacheable). Each instance of 
the DDR unit can interface with up to 16GB of memory (the 16GB is logically possible based on the functionality 
of the design, however the target maximum is 8GB because of physical design issues and the expectation that the 
largest DIMMs available in 2 or less rank configurations will be 83GB DIMMs in the foreseeable future). Because 
of the 64GB address space the address bus has 36 bits (35:0), however the DDR units drops bits for the following 
reasons: 

1. Bit 35 is dropped because it is always 0 for main memory references. 

2. Bit 6 is dropped because it is used to decide which DDR interface a request goes to 

(i.e. it is always fixed for a given interface). 

3. Bits 2:0 are not used because byte addressable requests are not supported by DDR2. 

So for example the incoming address coh_ddr_RdAddr_c2a[35:0] becomes addr[34:7,5:3] => addr[33:3]. Addr[33:3] 
is the format used within the DDR unit. 

The DDR section handles 64B memory references (including ECC they are 72B requests). Reads presented to 
the DDR unit are required to be full 64B reads. It returns the requested quadword (QW) (128 bits + 16 bits ECC) 
first for read requests according to Table 6.1. The read address presented to memory is QW aligned (i.e. address[3] 
is always driven LOW on the address send to DDC). The only supported write transaction sizes are 64B and 32B. 
Write requests for 64B blocks must be aligned such that the starting address is 000 (the starting address is specified 
by bits [5:3] of the incomming address). 32B writes are converted into 64B memory writes with the byte mask bits 


driven “low” to prevent updating memory for the invalid part of the transfer. Write requests for 32B blocks must 
be aligned such that the starting address[5:3] is 000 or 100. 


Table 8.3: Data Transfer Order 


Address[7:0] | Address[5:4] | Order of doublewords | Data on CSW 
(DWs) out of DIMM {Data0, Datal, Data2, Data3, Data4, Data5, Data6, 


x00 or OS [OO 
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Table 8.5: Types of Memory writes: 


Given that the order of write data arriving at DDR from CSW is: 
{Data0, Datal, Data2, Data3, Data4, Data5, Data6, Data7}. 
“None” => write mask bits are deasserted so that data in not overwritten in main memory. 


coh_ddr_WrHalfMask_c4a | coh_ddr_WrAddr_c4a[5] Order of data sent out to memory 


‘E_CohHalfMask_W64 {Data0, Datal, Data2, Data3, Data4, Data5, Data6, Data7} 
‘E_CohHalfMask_W64 {Data4, Data5, Data6, Data7, Data0, Datal, Data2, Data3} 


‘E_CohHalfMask_L32 
‘E_CohHalfMask_H32 
‘E_CohHalfMask_H32 


‘E_CohHalfMask_L32. | == 0 ~———_—*+|_—s {Data0, Datal, Data2, Data3, None, None, None, None} 


{None, None, None, None, Data0, Datal, Data2, Data3} 
{Data4, Data5, Data6, Data7, None, None, None, None} 
{None, None, None, None, Data4, Data5, Data6, Data7} 
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8.3.7 Interface Between DDR and the Coherence Controller (COH) 


Table 8.7: COH/DDR Interface 


coh_ddr_RdValid_c2a 
coh_ddr_RdAddr_c2a[35:3 
coh_ddr_RdTID_c2a|4:0] 

| 


coh_ddr_RaWShootDown_c3a 
coh_ddr_RdShootDown_c4a 
coh_ddr_Wr Valid_c4a 


:3] 
coh_ddr_WrHalfMask_c4a[1:0] 
:3] 


coh_ddr_WrAddr_c4a[35:3 


coh_ddr_WrTID_c4a/4:0 
coh_ddr_Data0_c4a[71:0 


coh_ddr_Data7_c7a[71:0 


ddr_coh_WrTIDValLc5a 


ddr_coh_WrTID_cda 


ddr_coh_BackPressure_c5a 


ddr_coh_DataValid_c2a 


ddr_coh_DataTarget_c2a 


ddr_coh_RdShotDown_c2a 


_c2a[8:0] 
ddr_coh_DataTID_c2a|4:0] 


ddr_coh_Data5_c4a/71:0 


ddr_coh_Data6_c5a/71:0 
ddr_coh_Data7_c5a/71:0 


8.4 DDI Section 


8.4.1 Overview 


The DDI block is the interface bet 


See “Table 2: Types of Memory Writes” for a description of how 
the half mask is used. Qualified by coh_ddr_WrValid_c4a. 


Asserted when a write has been completed (safe to resue the TID 
TID of a completed’ write request, Qualified by 
ddr_coh_WrTIDValLc5a 


Asserted if DDR can’t accept anymore requests 
Asserted when a read is returning data 


Asserted to signify a read request is being sent from COH to DDR 


CSW target vector corresponding to read data return. Qualified 
by ddr_coh_DataValid_c2a 

Contains the TID for either: 

1. Read data returning, Qualified by ddr_coh_DataValid_c2a 

2, Read which was _— shotdown, Qualified by 
ddr_coh_RdShotDown_c2a 


ween the Coherence Controller (COH) and DDR2 Controller (DDC). The 


DDI accepts requests (read and write commands) from the COH and issues them to the DDC. DDI has two clock 
domains, the CCLK which interfaces with the COH, and the DCLK domain which interfaces with the DDC. All 


clock domain crossings are done using 
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| Ons [5ns | 10ns |15ns |20ns |25ns | 
Reh oe Pe Od he ee i UP I A i ee ne ae As is 


ee Cay ee ey ee ey en, ee ee, ee en, ae 
coh_ddr_WrValidc4a fe 
coh_ddr_WrTID_c42[4: 0] SX 0 _P'SO C qe 
eoh_ddr_WrAddr_ce4a[35: 3] 
coh_ddr_WrHalfMask_c4a[1:0] [IX ___¥¥6 
coh_ddr_Data0_c4a[71:0] 
coh_ddr_Data1_c4a[71:0] [IK UstaT (iii 
coh_ddr_Data2_c3a[71:0] aK Uctc: (ii 
coh_ddr_Data3_c3a[71:0] i  Uctos (iii 
coh_ddr_Data4_c4a[71:0] i)" Uctot (i 
coh_ddr_Data5_c4a[71:0] ES ct cS 
coh_ddr_Data6_c5a[71:0] 
coh_ddr_Data7_c5a[71:0] 
ddrcoh_WrTlDValeSa 
ddr_coh_WrT|D_c5[4: 0) i PSO) Cay 


Figure 1: Write request and completion (Note there is Not a fixed time between request and completion) 
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| ons [5ns | 10ns |15ns |20ns |25ns | 
| Tigre, if hel ol a (mies, | 1 ol a | | | | I oe ee | 


Gb. Sse a = ae ON - Re 
coh_ddr_FdValid_c2a _ / \ a 
coh_ddr_FdT!D_c2a[4: 0) KD _P'S0 | CQ 
coh_ddr_RdAddr_c2a[35:3] BK A¢_ADD 7 Qi 
coh_ddr_RaWShootDownc33 i. . 7 
coh_ddr_AdShootbown 1: i) |. J. 
ddr_coh_DataValid_c2a f ‘ 
ddr_coh_RdShotDown_c2a 
ddr_coh_DataTarget_c2a[6: 0] 
ddr_coh_DataT|D_c2[4: 0] [Ii 11S 0) 0 a 
ddr_coh_Data0_c2a[?1:0] 
ddr_coh_Data1_e2a[71:0] 
ddr_coh_Data2_c3a[71:0] 
ddr_coh_Data3_c3a[?1:0] 
ddr_coh_Data4_c4a[71:0] 
ddr_coh_Data5_c4a[71: 0] 
ddr_coh_Data6_c5a[71: 0] 
ddr_coh_Data?_c5a[71: 0] 


Figure 2: Read request with normal data return. 
(Note this figure does not illustrate accurate delay 
between the arrival of the Rd request and the data 
return}. 


| ons [5ns | 10ns |15ns |20ns |25ns | 
ne ee eae che ee Po on Ge TS Se ie Ey 


220 4 a a Sa er Sa oS a a a a, ce cn: ei 

coh_ddr_PdValide2a fe 
coh_ddr_FdTID_c2a[4: 0] (K T1D_P'S0 1 gi 
coh_ddr_AdAddr_c2a[35:3] IK A¢d_ADD qi 
coh_ddr_RaWShootDown_c3a i £é ‘ee 
coh_ddr_AdShootDown_c4: ET J 
ddr_coh_DataValid_c2a 
ddrcoh_RdShotDownc2a Of 
ddr_coh_DataTID_c22[4: 0] [ie 11 PSO) Cay 


Figure 2: Read request which is shot down. 
(Note 1. There is not a fixed time between the Fd request and the 
FdShotDown completion notification. 
Note 2. The behavior is similar if coh_ddr_FdShootDown_c4a asserted 
or if both assert in their respective valid cycles). 
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8.4.2 Request Path 


DDI can accept one read and one write command every cycle, and is structured to handle up to a total of 20 
write requests and 20 read requests. Each request comes with an associated address, TID, and a valid signal. Write 
requests arrive coincident with the first cycle of the write data transfer. The request path for reads and writes 
are separate for most of DDI, allowing read and writes to pass each other (the COH prevents hazards). Incoming 
requests are flopped into a flop-based synchronizing fifo (one for reads and another for writes). Requests are read 
out of the input fifo on the DCLK and transfered to a bank fifo (based on the bank bits of the address). Since DDC 
is designed to manage 8 banks of memory, DDI has 8 read bank fifos and 8 write bank fifos. The head entry of the 
bankfifos arbitrate for access to DDC. Each cycle, a two-level arbiter selects a request to send to the DDC (if there 
is a valid one). The first level has parallel arbiters (one for reads, and one for writes), each of which round-robins 
between the valid head entries of the 8 bank FIFOs. The second level chooses which wins. The grant algorithm 
gives preference to reads for a fixed number of consecutive grants, then to writes for a fixed number of consecutive 
grants (the ratio of reads to write grant preference is set through a configuration register). In any cycle where no 
reads or writes are bidding from any of the bank FIFOs, the arbiter will select the head entry of the read input 
FIFO if it is valid. The request which wins arbitration is flopped and goes through logic to be issued to the DDC. 

Refer to the DDC section for documentation and waveforms decribing the interface between DDI and DDC for 
issuing requests to DDC. 

When the DDC accepts a write request, the write request is pushed onto the Write Data Pending Fifo where 
it will remain until the DDC asks for the associated write data (There is no fixed timing between when the DDC 
accepts a write request when it will be ready to accept the write data). When DDC asks for the write data (which 
is supplied by the data path logic) the entry is deallocated from the Write Data Pending Fifo so the TID of this 
completed write can be sent to the COH. This is the point where the COH can safely release the write from the 
Write Back Cam. The TID needs to be synchronized back over to the CCLK domain before sending it to the COH. 
This is done through the Write Complete Transfer Fifo. 

When the DDC accpets a read request, the read request is pushed onto the Read Return Pending Fifo which 
synchronize from the DCLK to CCLK domain. The head entry is deallocated (providing the TID) when the DDD 
section signifies the return of read data. The Read TID is used to construct the CSW target vector. 


8.4.3. Read Shoot Down 


The request path incorporates logic to allow reads to be shot down. This allows the COH to issue reads 
speculatively to improve performance and also to kill reads which would cause a RAW hazard due to a write in 
DDI which has not as yet completed. By the time the shoot down signal is received in DDI, the read may be in 
the forward path (not yet issued to DDC) or the return path (in the Read Return Pending Fifo). Shoot down 
commands are logged into a vector (m_RdShdVec_c4a[19:0]), where each entry corresponds to one of the 20 possible 
TIDs available for read usage. When a read request is issued from COH it clears the corresponding entry in the 
shoot down vector, and when a shootdown is received from the COH it sets the corresponding entry. Because the 
TID is used to execute the shootdown, DDI cannot accept another request with the same TID until the shootdown 
completion has been confirmed via the ddr_coh_RdShotDown_c2a / ddr_coh_DataTID_c2a signal set. 

In the forward path, reads that win arbitration for access to DDC are checked against a DCLK domain copy 
of the shoot down vector (m_RdShdVec_copy_d5a[19:0]) and not issued to DDC if the corresponding entry is set. 
Instead of entering the Read Return Pending Fifo, the TID of the “dropped read” is allocated into the Shootdown 
Forward Path Transfer Fifo which synchronizes over to the CCLK domain. The head entry will deallocate, cause 
the assertion of ddr_coh_RdShotDown_c2a, and drive the shotdown TID onto ddr_coh_DataTID_c2a (this is done 
during a cycle where ddr_coh_DataValid_c2a will not assert (DDI knows a cycle adhead of time before data will 
return)). 

In the return path, when read data returns the head entry of the Read Return Pending Fifo provides the TID 
to index into m_RdShdVec_c4a[19:0]. If the corresponding bit set, then ddr_coh_RdShotDown_c2a will assert while 
ddr_coh_DataValid_c2a is forced low and the shotdown TID is driven onto ddr_coh_DataTID_c2a. 


8.4.4 Data Path 


Write Data arrives at DDI piped into 4 consecutivel44-bit chunks (128b data + 16 ECC bits). The first 
144-bits arrives coincident with the write valid signal, TID, HalfMask, and destination address. When the write 
request arrives the address is checked to make sure it is not outside the range of the memory defined by the DDR 
configuration registers. The write data is stored in a register file, indexed by the WrTID. The register file is written 
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in the CCLK domain when the request is issed from COH, and is read out on the DCLK domain when the DDC 
requests the data for a write request which was previously accepted. The delay between with the register file is 
written and the earliest DDC can request the write data is guaranteed to be long enough to avoid a synchronization 
violation on the register file. When data is read out of the register file it is sent to DDC in 4 consecutive 144-bit 
chunks. 

The details of the read datapath are discussed below in the DDD section and DDP unit descriptions. 


8.4.5 Requests to non-existent memory 


Request to non-existent memory are accesses which have upper address bits set which are outside of the range 
for the selected DRAM configuration. The CSR DdrxDdiMifCfgl1MemAddrSize[2:0] is used to determine if a 
request is to non-existent memory. Based on this CSR, the upper bits or the address presented to DDC are forced 
low (forces address aliasing). The memory requests will complete as normal using the aliased address (i.e. writes 
to non-existent memory are software errors which will result in data corruption). 

It is required that DdrxDdiMifCfg1MemAdadrSize[2:0] be set correctly, otherwise a read to non-existent memory 
could case fatal errors in the read return logic by resulting in a read which does not get a response from memory 
(i.e. it maps to a chip select for a non-existent rank). This would throw off the fifo pointers in the read return logic 
causing reads to return data that was meant to correspond to subsequent reads. 


8.4.6 Powerdown 


The memory interface includes logic to issue power-down commands to memory if the interface is idle for a 
user controller number of cycles. Using power-down reduces the power dissipation in the memory DIMMs. It is 
expected that enabling power-down will have a minimal impact on performance, since wake up from powerdown 
is on the order of a few cycles. Any impact can be mitigated by increasing the number of idles required before 
power-down is entered. It may be possible for power-down to impact performance for some code patterns. 


8.4.7 Read Time-Out 


The DDR unit includes read time-out detection logic which is intended as a debug tool for improperly configured 
systems (for example if the settings of the DLL in the PHY are incorrectly programmed potentially causing the 
return of read data to be dropped). The read time-out logic can be used to indicate such a problem. It is not 
intended for use during normal system operation. It will not precisely indicated which particular read has hung, 
and it may fire after allowing numerous returns of bad read data in a poorly configured system. The reason for this 
is that reads return in order. Thus if a particular read is dropped, any subsequent read returning will be applied 
to the wrong requester. Thus it is only after reads have stoppped returning, that we can be sure that there is a 
problem when DDR still has one or more reads waiting for data. 

In general, we can bound the amount of time that a read should be outstanding once it has been issued to DDC 
(the Northwest logic memory controller). Since the read-time out logic is never expect to be needed during normal 
operation the count was chosen to much larger than necessary to be conservative. The count used is 4096 clock 
cycles (which is probably 8 times the real worst case). 

Each of the 28 TIDs have an associated counter which can count to 4096 dclk cycles. When DdrxDdiRdTime- 
Out_Enable is set, these counters are enabled to start counting when a read of the corresponding TID is issued 
from DDI to DDC. DdrxDdiRdTimeOut_Enable will be set to a 1 if a read hang is dectected for any read TID (this 
is sticky and will remain set until it is cleared via the SCB bus (note it is W1C ’write one to clear’, so software 
must write a 1 to the corresponding bit place in order to clear it out)). 

If DdrxDdiRdTimeOut_AutoCompletion is set, then if a read is determined to have hung, the DDR unit will 
return a fake completion message (assertion of ddr_coh_DatValid_c2a or ddr_coh_RdShotDown_c2a). The DDR 
unit will return whatever data values are in its read data path flops. Note that if the read data was corrupted it 
may result in an uncorrectable ECC error pattern on the returning data. Read Time-Out AutoCompletion is a 
feature which is intended to be used primarily for the calibration of the read path DLLs and for debugging, however 
it can be enabled during normal operation if software finds it useful. 


8.4.8 Registers and Definitions 


This subsection defines the CSR registers, while the next subsection creates the two instances. The CSRs live 
in the DdrDdiCsr sub-module of DDI, which runs on the DCLK. The CSRs are written and read via the ICE9 
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Serial Configuration Bus. 

The values of the registers R-DdrxDdcMemCfg1-5, R-DdrxDdcDIMMODT, R_DdrxDdpODT, and R_DdrxDIMMSize 
may only change prior to the de-assertion of R-DdrxDdcDdpSoftReset. More specific, information is located in the 
“Reset and Initialization” section of this chapter. 

The values of the registers R-DdrxDdiMifCfg1-2 can be changed at any time. 

The “SPD Byte #” column in the tables below is provided as a hint as to what information may need to be read 
from the DIMMs’ SPD in order to figure out what value to set for the corresponding CSR field. Note that many 
of the parameters accessed from SPD are in time units while the many of the corresponding CSRs are in units of 
DCLK cycles. 


8.4.8.1 R_DdrxDdcDdpSoftReset - Soft Reset for DDC and DDP 
Register 
R_DdrxDdcDdpSoftReset 


Address 


0x0_0000_0000 (plus base address) 
Ee es a 


InitDimm ICE9B+ | 1 -> 0 transisiton tells controller to re-issue the initial- 
ization sequence to the DIMM. The controller will al- 
ways issue the initialization sequence after SoftResetDDC 
is deasserted (goes low) regardless of the state of this 
InitDimm. InitDimm can be left low if run-time re- 
initialization is not required. 

SoftResetDDP ICE9B+ | Used as the reset signal for DDP. 

Separating this from the reset to DDC allows DDP to 
wake up first and calibrate it’s IO driver output impe- 
dence, before we wake up DDC and have it start the 
JEDEC DRAM init sequence 


SoftResetDDC RW 1 ICE9B+ | Used as the reset signal for DDC. 
Can only be deasserted after setting the correct CSR val- 
ues to R_DdrxDdcMemCfg1-5, R-DdrxDdcDIMMODT, 
R_DdrxDdpODT, and R_DdrxDIMMSize. The de- 
assertion (transition from HIGH to LOW) causes the 
DDR2-SDRAM controller to issue the JEDEC standard 
initialization sequence to the SDRAM devices. (Note the 
Type “L” is an indication that this is intended to normally 
be the last CSR written). Overlaps SoftReset. 

SoftReset RW 1 ICE9A Used as the reset signal for DDC and DDP. 
Can only be deasserted after setting the correct CSR val- 
ues to R_DdrxDdcMemCfg1-5, R-DdrxDdcDIMMODT, 
R_DdrxDdpODT, and R_DdrxDIMMSize. The de- 
assertion (transition from HIGH to LOW) causes the 
DDR2-SDRAM controller to issue the JEDEC standard 
initialization sequence to the SDRAM devices. (Note the 
Type “L” is an indication that this is intended to normally 
be the last CSR written. 


8.4.8.2 R_DdrxDdcMemCfg1 - Memory Controller Configuration Register 1 


Register 
R_DdrxDdcMemCfg1 
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Address 


0x0_0000_0004 (plus base address) 


(Valid Values) [ (SPD Byte #) 


31 | PchPowerDown | RW 0-1 *** This feature in NOT supported. It 
is arequirement that software write a “0” 
to PchPowerDown before bringing the 
DDR interface out of reset. *** 
[recy 26 Peet RW Active to precharge (tRAS), 

25:23 | RCD RW Active to read or write delay (tRCD), 
pepe pe] oe 
22:20 | RRD RW 2-4 28 Active bank a to active bank b (tRRD), 

19:17 | RP RW Precharge command period (tRP), 
Active to active/auto-refresh period 
(tRC), 
specified in DCLK cycles 
Auto-refresh to active/auto-refresh pe- 
riod 
(tRFC), specified in DCLK cycles 
Read to precharge delay (tRTP) speci- 


fied 
in DCLK cycles 


EE CO 


8.4.8.3 R_DdrxDdcMemCfg2 - Memory Controller Configuration Register 2 


Register 


R_DdrxDdcMemCfg2 


Address 


0x0_0000_0008 (plus base address) 
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(Valid Values) | (SPD Byte 


31:29 | MRD 1-7 load mode register cmd to active or refresh, 
specified 
in DCLK cycles. 
2 is valid minimum value for tMRD for a 
wide range of DDR2 parts. 
oc re eee [reseed SSCS~S 
a 18 Peo al (9, 23,25) eee latency, specified in DCLK cycles 
a ee or sats config file (Note: CAS latency of 3 is NOT supported) 
inet 15 a on recovery time (tWR), specified in 
el 12 | WTR Write to read cmd delay (tWTR), specified 


11: basa AL RW Additive latency, specified in DCLK cycles 
Note that non-zero AL values may improve 
DDR2 bus utilization and hence _perfor- 
mance, especially for random access pat- 
terns and/or if reads and writes are issued 
with auto-precharege. 


Four bank activate period (tFAW), specified 
in DCLK cycles 

This defaults to an acceptable value. Other 
choices are provided below. 

From JEDEC Spec 79-2B 

DDR2 400/800 - 35ns => 14 cycles 

DDR2 333/667 - 37.5ns => 13 cycles 
DDR2 266/533 - 50ns => 14 cycles 


Reserved 


8.4.8.4 R_DdrxDdcMemCfg3 - Memory Controller Configuration Register 3 
Register 
R_DdrxDdcMemCfg3 


Address 
0x0_0000_000c (plus base address) 


(Valid Values) | SPD Byte 
Bite. = a Se re 


oe a Number of bits in the bank address (en- 

coded). Values are mapped as follows: 
0 - 2 bank bits (i.e. 4 bank chips) 
1 - 3 bank bits (i.e. 8 bank chips) 

27:25 | Rowbits RW Number of bits in the row address (en- 
coded) 
3 - 14 row bits 
4 - 15 row bits 
5 - 16 row bits 


24:8 | Delay RW 10-131071 reset to SDRAM init delay specified in 
DCLK cycles. 
Valid values: 10 - 131071 
At 400Mhz DDR Delay = 80000 * 2.5ns = 
200us 
(JEDEC requires minimum of 200us) 
a a a a es 


Reserved 
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8.4.8.5 R_DdrxDdcMemCfg4 - Memory Controller Configuration Register 4 
Register 
R_DdrxDdcMemCfg4 


Address 
0x0_0000_0010 (plus base address) 


(ai Vas) [PD Bote) 


31:16 | REFI 10-65535 Period between auto-refresh commands issued 
by the controller, specified in DCLK cycles. 
ref = auto refresh interval/tCk 
tREFI should be set to 7.8us. 

AOOMHz => 3125 

333MHz => 2604 

267MHz => 2083 

Note: JEDEC 79-2B requires setting tREFI 
to 3.9us if 85 degrees C < tCASE <= 95 de- 
grees C. Preliminary studies show that tCASE 
is expected to be below 70 degrees in our sys- 
tem. 


[15 | Regdimm [RW] 0 | __01 _ |__| Set when using registered [buffered DININT 


14 | DS RW 0-1 TE RCE drive strength setting programmed into 
Extended Mode Register Bit 1. Values 
mapped to 
EMR as follows (refer to DDR2 SDRAM de- 
vice 
data-sheet for description of drive strength 
settings): 

0 - EMR[I1] = 

1- EMR(1] 1 (SPD Byte #22 reports 
whether this 

is supported) 

13:12 | Rtt ODT effective resistance Rtt. DDR2 On-Die 
Termination effective resistance setting 
programmed into Extended Mode Register 
bits 2 and 6. Values mapped to EMR as fol- 
lows: 

0 - EMR[ Rtt disabled) 
1 - EMR| 75 ohms) 
2- EMR|[ 150 ohms) 
3 - EMR|6] = = 1 EMR|2 = 50 ohms (not 
supported on slowest Sere, 
SPD Byte #22 reports whether 50 ohms is 
supported 150 ohm setting may be appropriate 
for interfacing to 1 and 2 rank DDR2 DIMMs 
running at 333/667 or 400/800. 

off ¢ ion. is si 


passed to bit E12 of the Extended Mode 
Register during initialization. Typically set 
to ’0’ to enable data and strobe outputs from 
the 

SDRAM devices. Can be set to ’1’ for IDD 
characterization of read current. 


Reserved 
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8.4.8.6 R_DdrxDdcMemCfg5 - Memory Controller Configuration Register 5 
Register 
R_DdrxDdcMemCfg5 


Address 
0x0_0000_0014 (plus base address) 


31:16 | emr2 RW Value programmed into DIMM’s Extended Mode Regis- 
ter 2 during initialization. Most DDR2 SDRAM devices 
specify all of these bits as reserved (must be set to 0). 


15:0 | emr3 RW Value programmed into DIMM’s Extended Mode Register 
3 during initialization. 
Most DDR2 SDRAM devices specify all of these bits as 
reserved (must be set to 0). 


8.4.8.7 R_DdrxDdcMemCfg6 - Memory Controller Configuration Register 6 
Register 
R_DdrxDdcMemCfg6 


Address 
0x0_0000_0018 (plus base address) 


(Valid Values) 
Pais] ~~«Y [ete] ReSreEES Reserved 


[Reserved 

ae as Causes DQ and DQS to be driven during idle peri- 

ods (when no read nor writes are occuring). If this 

bit is set, the bus will be driven during idle periods 

as follows: 

- After a write, bus will remain driven. DQ lines 

will be driven with value of last data phase. 

- After a read, bus will be driven # clocks after the 

end of the read postamble where # is selected using 

ReadToldleDriveDelay. The bus will be driven to a 

value of 72’haa_aaaa_aaaa_aaaa_aaaa. 


ReadToldleDriveDelay Delay to DQSP, DQSN, and DQ output enable 
switch-on after a read command relative to end of 
read postamble. 
0x0 : -1.0 clocks 
Ox1 : 0 clocks 
Ox2 : 1.0 clocks 
Ox38 : 2.5 clocks 

14 | LookaheadPch RW 1 0-1 Look ahead precharge enable. When enabled the 
controller will look ahead into the command 
queue and analyze the queued requests and 
perform precharge operations as soon as possible 
in order to maximize bandwidth efficiency. 

0 - disable 
1 - enable 
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13 | Lookahead Act 


LookaheadApch 


Look ahcad activate enable. When enabled, the 
controller will look ahead into the command 
queue and analyze the queued requests and 
perform activate operations as soon as possible 
in order to maximize bandwidth efficiency. 

0 - disable 

1 - enable 


0-1 Look ahead auto-precharge enable. When enabled 
the 
controller will look ahead into the command 
queue and analyze the queued requests and 
perform an auto-precharge operation to the current 
read or write operation in order to maximize 
bandwidth efficiency. 
0 - disable 
1 - enable 
SS Se ee 
a a ee 
0-7 


OdtAdvTurnOn ae ODT turn-on by one e clock 
oe | Delay ODT turn-off by on clock turn-off = on lp 
Two cycle timing (2T) enable. When enabled, the 
controller extends the timing of the SDRAM control 
signals (ras, cas, and we) to be two clocks in dura- 
tion. 
1 - enable 
0 - disable 
Two cylce timing cycle select. Controls which phase 
of 
the two clock cycle command period the cs_n is as- 
serted. 
0 - cs_n asserted during the first cycle 
1 - cs_n asserted during the second cycle. 

1, 2,3 Read to write delay (valid values: 1,2,3) 


10 | OdtDelayTurnOff 


TwoTMode PEL 


TwoTModeSelCycle 


6 ReadToWrite 


| 
ReadToRead ttt 
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Minimum delay from write to write (different 
ranks). 

NOTE: that zero is a legal choice ONLY if 
R_DdrxDdcDIMMODT_OdtWrMapCs* = 0000. 
(Setting this to zero, can cause ODT problems, as 
the ODT spec requires turn on 3 cycles before the 
data and turn off 2 cycles before the data, thus if 
the data to different ranks was back to back, then 
switching to the ODT for the second write causes 
the first to switch prematurely) 


Minimum delay from read to read (different ranks). 
NOTE: that zero is a legal choice ONLY if 
R_DdrxDdcDIMMODT_OdtRdMapCs* = 0000. 
(Setting this to zero, can cause ODT problems, as 
the ODT spec requires turn on 3 cycles before the 
data and turn off 2 cycles before the data, thus if 
the data from different ranks was back to back, then 
switching to the ODT for the second read causes the 
first to switch prematurely). 

NOTE: also that a value of zero may have the poten- 
tial of resulting in output drive contention between 


ranks. 
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8.4.8.8 R_DdrxDdcMemCfg7 - Memory Controller Configuration Register 7 
Register 
R_DdrxDdcMemCfg7 


Address 


0x0_0000_001c (plus base address) 
(Walle Vates) 
Ee aaa a ees i esetved. <8 Fe en ee SI 


=> “arte Disables automatic initialization handled by 
controller 


Tai RW Mode Register towmitete——S—* 
te Sats AW || =| Comer toate aatar —— 
[1 [initPrechargeAT [RW [0 [0-1 | Tssue precharge-all command —_—~*Y 
[0 PhnitRefesh | RW_[ 0 [0-1 | Tssue refresh command SSS 


8.4.8.9 R_DdrxDdcDIMMODT - Memory Controller ODT Selection Matrix Configuration 


The defaults for R-DdrxDdcDIMMODT are expected to be appropriate for the target single and dual rank 
configurations of one DIMM slot based on reviewing preliminary termination matrix recommendations presented 
by Samsung for 667 data rate operation and Micron for 667 and 800 data rates. We plan to follow the industry 
recommendations for single-DIMM-slot designs, which call for ODT on the active DIMM rank only, during writes, 
and ODT on the controller only, during reads. 

Register 


R_DdrxDdcDIMMODT 


Address 
0x0_0000_0020 (plus base address) 


31:28 | OdtRdMapCs0 RW Selects which DRAM ODT outputs are enabled when reading 
from chip select 0. 
ex: odt_rd_map_cs0=4’b1110 will enable odt[1], odt[1], and 
odt[2] during a read from memory devices on chip select 0. 


Poel 24 Faire | Naa ales which DRAM ODT outputs are enabled when reading 
eal chip select 1. 


23:20 | OdtRdMapCs2 RW Selects which DRAM ODT outputs are enabled when reading 
pn [|S ror 
9:16 | OdtRdMapCs3 Selects which DRAM ODT outputs are enabled when reading 
eee |? [nom eipsdee 
15:12 | OdtWrMapCs0 RW 1 Selects which DRAM ODT outputs are enabled when writing 
Cee MLE [ecipaiat oN OES 
11:8 | OdtWrMapCs1 RW 2 Selects which DRAM ODT outputs are enabled when writing 

erik ea to chip select 1 
7:4 | OdtWrMapCs2 RW Selects which DRAM ODT outputs are enabled when writing 
a ee ee | Selects which DRAM ODT outputs are enabled when writing 


8.4.8.10 R_DdrxDdpODT - On-Die-Termination resistance value on ICE9 DDR2-I/O PADs during 
reads 


to chip select 3 


Register 
R_DdrxDdpODT 
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Address 


0x0_0000_0024 a base gaia 


31:30 ee SS Se On-Die-Termination value used in the DDR PHY IO 
cells. Maps to the values driven into the {TERM150, 
TERM300} pins of the ARM IO cell. 
00 - Rx Mode, ODT disabled 
01 - Rx Mode, 150 ohm calibrated ODT 
10 - UNDEFINED IN ARM SPEC 
11 - Rx Mode, 75 ohm calibrated ODT 
The 150 Ohm setting is expected to be sufficient. How- 
ever, it may 
neesssaly to use the 75 Ohm setting for 400/800 systems. 


Paso Reserved SOSCSCSCSCSCSCSC‘S 


8.4.8.11 R_DdrxDIMMsSize - Size of the DIMM this DDR unit instance is interfacing with. 
Register 


R_DdrxDIMMSize 


Attributes 


-kernel 


Address 


0x0_0000_0028 (plus base address) 
DRC A 
ES ES 


a 0 EE Total memory connect to this DDR interface (half of the 
total main memory space per ICE9). DIMM Rank Den- 
sity * Number of Ranks 
Used to filter out requests to non-existent memory. 
Valid values 0 - 4 
0-1GB 
1- 2GB 
2- 4GB 
3 - 8GB 
4- 16GB 


8.4.8.12 R_DdrxDdiMifCfg1 - Memory Interface Configuration Register 1 


Register 


R_DdrxDdiMifCfg1 


Address 


0x0_0000_002c (plus base address) 
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| 31:9 | 1 aa |__| Reserved 


So 
on 1 ST a OxOF | Each bit set represents an additional 1 out of 10 cycles 

where reads have arbitration preference over writes. This 
allows for performance tuning by allowing more/less reads 
to pass independent write requests in DDI. 

Note: 

1. ArbPrefWheel should always be programmed with con- 
tiguous bits set (to minimize DDR bus turn around time 
penalty of switching from reads to writes or vice-versa. 
More specifically, ArbPrefWheel should be programmed 
to one of the following values: 

00000000 

00000001 

00000011 

00000111 

00001111 

00011111 

00111111 

01111111 

11111111 

2. the arbitration preference for 2 out of 10 cycles is not 
user controllable, but dedicated 1 for read and 1 for writes 
to prevent starvation if a user sets (or clears) all the bits 
of ArbPrefWheel. 


AutoPch The auto-precharge option is useful where the access pat- 
terns tend to be random (as seen at the DDR2 interface). 
With random sequences, banks are rarely left open with 
the exact row required by a subsequent request. If auto- 
precharge was not used for the previous access to a bank, 
subsequent accesses to that bank first require the bank to 
be closed (prechareged), causing a delay. 
0 - Requests issued as read / write without auto-precharge 
1 - Requests issued as read / write with auto-precharge 


8.4.8.13 R_DdrxDdiMifCfg2 - Memory Interface Configuration Register 2 
Register 


R_DdrxDdiMifCfg2 


Address 


0x0_0000_0030 (plus base address) 


May 14, 2014 ATT Rev 51328 


SiCortex Confidential CHAPTER 8. MEMORY CONTROLLER 


a SR 


aa INRE 0 - DDR2 is never issued the power-down command 
1 - DDR2 is issued the power-down command if the no 
read or write requests are sent to the memory interface for 
a period of time determined by the PwrDnCount setting. 


17:0 | PwrDnCount RW Number of ICE9 core clock (cclk) idle cycles before a 
power-down command is issued to memory. This is re- 
quired to be set to a value larger than (Twait = 2 * 
R_DdrxDdcMemCfg1_RFC) in delks. 

Examples for DIMMs configured with 1Gb devices: 
cclk/delk Twait 

250/400 - Twait = 102 dclks, PwrDnCount >= 64 cclks 
250/333 - Twait = 86 dclks, PwrDnCount >= 54 cclks 
250/267 - Twait = 68 dclks, PwrDnCount >= 43 cclks 
Note the R-DdrxDdcMemCfg1_RFC value used in these 
calcualtions are 

from “Table 39 - Refresh parameters by device density” of 
JESD79-2B (JEDEC Standard - DDR2 SDRAM Specifi- 
cation). 


8.4.8.14 R_DdrxPhyCfgl - PHY Interface Configuration Register 1 


Register 


R_DdrxPhyCfg1 


Address 


0x0_0000_0034 (plus base address) 
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(Watt Vatu 
| 31:12 | 12 SS a Reserved 


| Reserved 
11: a. Se DQS output enable switch-on time relative to 
start of write preamble. 
0x0: -1.5 clocks 
Ox1: -1.0 clocks 
0x2: -0.5 clocks 
0x3: 0 clocks 


DqsOcOff DQS output enable switch-off time relative to 

end of write postamble. 

0x0: -1.5 clocks 

Ox1: -1.0 clocks 

0x2: -0.5 clocks 

0x3: 0 clocks 

0x4: 0.5 clocks 

0x5: 1 clocks 

0x6: 1.5 clocks 

0x7: 2.0 clocks 


5:3 | DqOeOn RW 0-5 2 DQ output enable switch-on time relative to 
start of wirte preamble. 
0x0: -1.25 clocks 
Ox1: -0.75 clocks 
0x2: -0.25 clocks 
0x3: 0.25 clocks 


2:0 | DqOeOff RW 0-7 2 DQ output enable switch-off time relative to 
end of write postamble. 
0x0: -1.25 clocks 
Ox1: -0.75 clocks 
0x2: -0.25 clocks 
0x3: 0.25 clocks 
0x4: 0.75 clocks 
0x5: 1.25 clocks 
0x6: 1.75 clocks 
0x7: 2.25 clocks 


8.4.8.15 R_DdrxPhyCfg2 - PHY Interface Configuration Register 2 


Register 
R_DdrxPhyCfg2 


Address 
0x0_0000_0038 (plus base address) 


(Valid Values) 
ay s«dYSStidSCSCS~C~C~—C*~dCS ~SCOS*‘i Revd —SSCSC~“*~*~s*“‘—‘~S*S*S*S*~*S*S~S~*S 
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11:9 | AsicDqsOdtOn RW 0-5 2 Note there are two changes going from ICE9A 
to ICE9B: 
First - Bugzilla 2401 was fixed. 
Secondly - the range of adjustability was 
changed based on feedback from debug lab 
bringup studies on iceYa parts. 
DQS resistor output enable (ASIC side ODT) 
and pad input enable (IE-to-Y) switch-on time 
relative to start of read preamble. 
ICE9A RANGE: 
0x0: -2.5 clocks (Not supported if AsicDq- 
sOdtOff is set to 0x6 or 0x7 (Bugzilla 2401)) 
Oxl: -2.0 clocks (Not supported if AsicDq- 
sOdtOff is set to 0x6 or 0x7 (Bugzilla 2401)) 
0x2: -1.5 clocks 
0x3: -1.0 clocks 
Ox4: -0.5 clocks 
0x5: 0 clocks 
ICE9B+ RANGE: 
0x0: -1.5 clocks 
Ox1: -1.0 clocks 
0x2: -0.5 clocks 
0x3: 0 clocks 
0x4: 0.5 clocks 
0x5: 1.0 clocks 
Ox6: 1.5 clocks 
Ox7: 2.0 clocks 
Note: The ARM SSTL18 output buffer con- 
tains an AND gate which will disable the out- 
put enable when the resistor output enable is 
switched on. 
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AsicDqsOdtOff | RW 0-7 3 Note there are two changes going from ICE9A 
to ICE9B: 
First - Bugzilla 2401 was fixed. 
Secondly - the range of adjustability was 
changed based on feedback from debug lab 
bringup studies on ice9Ya parts. 
DQS resistor output enable (ASIC side ODT) 
and pad input enable (IE-to-Y) switch off time 
relative to the end of read postamble. 
ICE9A RANGE: 
0x0: -1.5 clocks 
Ox1: -1.0 clocks 
0x2: -0.5 clocks 
0x3: 0 clocks 
Ox4: 0.5 clocks 
0x5: 1.0 clocks 
0x6: 1.5 clocks (Not supported if AsicDqsOd- 
tOn is set to 0x0 or 0x1 (Bugzilla 2401)) 
0x7: 2.0 clocks (Not supported if AsicDqsOd- 
tOn is set to 0x0 or 0x1 (Bugzilla 2401)) 
ICE9B+ RANGE: 
0x0: -0.5 clocks 
Ox1: 0 clocks 
0x2: 0.5 clocks 
0x3: 1.0 clocks 
Ox4: 1.5 clocks 
0x5: 2.0 clocks 
Ox6: 2.5 clocks 
0x7: 3.0 clocks 
Note: The output enable of the ARM SSTL18 
I/O buffer will be disabled as long as the 
resistor output enable (ROE) pin is as- 
serted. Care must be taken to ensure that 
longer ROE switch off times do not inter- 
fere with subsequent writes. The timing 
of subsequent writes can be contolled using 
R_DdrxDdcMemCfg6_ReadToWrite 
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5:3 | AsicDqOdtOn RW 0-5 1 Note there are two changes going from ICE9A 
to ICE9B: 
First - Bugzilla 2401 was fixed. 
Secondly - the range of adjustability was 
changed based on feedback from debug lab 
bringup studies on ice9Ya parts. 
DQ resistor output enable (ASIC side ODT) 
and pad input enable (IE-to-Y) switch-on time 
relative to start of read preamble. 
ICE9A RANGE: 
0x0: -2.5 clocks (Not supported if AsicDq- 
sOdtOff is set to 0x6 or 0x7 (Bugzilla 2401)) 
Oxl: -2.0 clocks (Not supported if AsicDq- 
sOdtOff is set to 0x6 or 0x7 (Bugzilla 2401)) 
Ox2: -1.5 clocks 
0x3: -1.0 clocks 
Ox4: -0.5 clocks 
0x5: 0 clocks 
ICE9B+ RANGE: 
0x0: -1.5 clocks 
Ox1: -1.0 clocks 
0x2: -0.5 clocks 
0x3: 0 clocks 
0x4: 0.5 clocks 
0x5: 1.0 clocks 
Ox6: 1.5 clocks 
Ox7: 2.0 clocks 
Note: The ARM SSTL18 output buffer con- 
tains an AND gate which will disable the out- 
put enable when the resistor output enable is 
switched on. 
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2:0 | AsicDqOdtOff RW 0-7 3 Note there are two changes going from ICE9A 
to ICE9B: 
First - Bugzilla 2401 was fixed. 
Secondly - the range of adjustability was 
changed based on feedback from debug lab 
bringup studies on ice9Ya parts. 
DQ resistor output enable (ASIC side ODT) 
and pad input enable (IE-to-Y) switch off time 
relative to the end of read postamble. 
ICE9A RANGE: 
0x0: -1.5 clocks 
Ox1: -1.0 clocks 
0x2: -0.5 clocks 
0x3: 0 clocks 
Ox4: 0.5 clocks 
0x5: 1.0 clocks 
0x6: 1.5 clocks (Not supported if AsicDqsOd- 
tOn is set to 0x0 or 0x1 (Bugzilla 2401)) 
0x7: 2.0 clocks (Not supported if AsicDqsOd- 
tOn is set to 0x0 or 0x1 (Bugzilla 2401)) 
ICE9B+ RANGE: 
0x0: -0.5 clocks 
Ox1: 0 clocks 
0x2: 0.5 clocks 
0x3: 1.0 clocks 
Ox4: 1.5 clocks 
0x5: 2.0 clocks 
Ox6: 2.5 clocks 
0x7: 3.0 clocks 
Note: The output enable of the ARM SSTL18 
I/O buffer will be disabled as long as the 
resistor output enable (ROE) pin is as- 
serted. Care must be taken to ensure that 
longer ROE switch off times do not inter- 
fere with subsequent writes. The timing 
of subsequent writes can be contolled using 
R_DdrxDdcMemCfg6_ReadToWrite 


8.4.8.16 R_DdrxPhyCfg3 - PHY Interface Configuration Register 3 


Register 


R_DdrxPhyCfg3 


Address 


0x0_0000_003c (plus base address) 
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(Valid Values) 
A ST 


| Reserved 
DqsPreambleEnnOn Read preamble enable switch-on time relative 

to start of read preamble. 

0x0: -0.5 clocks 

Ox1: 0 clocks 

0x2: 0.5 clocks 

0x3: 1.0 clocks 

0x4: 1.5 clocks 

0x5: 2.0 clocks 


10:8 | DqsPreambleEnnOff Read preamble enable switch-off time relative 

to the third edge of the read DQS. 

0x0: -1.0 clocks 

Ox1: -0.5 clocks 

0x2: 0 clocks 

0x3: 0.5 clocks 

0x4: 1.0 clocks 

Ox5: 1.5 clocks 

Ox6: 2.0 clocks 

Ox7: 2.5 clocks 


8 0 


8.4.8.17 R_DdrxDdpDLLLane0 - PHY Read Lane 0 DLL Configuration Register 
Register 


R_DdrxDdpDLLLane0 


Address 


0x0_0000_0040 (plus base address) 
sia | |__| Revered 


23:16 | MasterAdj RW 186 | Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 


15:8 | SlaveQAdj RW 1 Slave DLL to delay dummy DQS to match the 
DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 
7:0 | Slavel Adj RW 12 Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 


8.4.8.18 R_DdrxDdpDLLLanel - PHY Read Lane 1 DLL Configuration Register 


Register 


R_DdrxDdpDLLLanel 


Address 


0x0_0000_0044 (plus base address) 
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Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 

Slave DLL to delay dummy DQS to match the 


23:16 | MasterAdj 


— 
_ 


DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 

Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 


8.4.8.19 R_DdrxDdpDLLLane2 - PHY Read Lane 2 DLL Configuration Register 


Register 


R_DdrxDdpDLLLane2 


Address 


0x0_0000_0048 (plus base address) 


8 


33: 16 | MasterAdj ae 


— wie 
~~ 


8.4.8.20 R_DdrxDdpDLLLane3 - PHY Read Lane 3 DLL Configuration Register 


Register 


R_DdrxDdpDLLLane3 


Address 


0x0_0000_004c (plus base address) 


May 14, 2014 


Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 

Slave DLL to delay dummy DQS to match the 
DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 

Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 
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23:16 | MasterAdj RW 186 | Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 


15:8 | SlaveOAdj RW 1 Slave DLL to delay dummy DQS to match the 
DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 
7:0 | Slavel Adj RW 12 Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 


8.4.8.21 R_DdrxDdpDLLLane4 - PHY Read Lane 4 DLL Configuration Register 


Register 


R_DdrxDdpDLLLane4 


Address 


0x0_0000_0050 (plus base address) 


8 a 


33: 16 | MasterAdj 186 | Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 


15:8 | SlaveOAdj RW Slave DLL to delay dummy DQS to match the 
DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 
7:0 | Slavel Adj RW Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 


8.4.8.22 R_DdrxDdpDLLLane5 - PHY Read Lane 5 DLL Configuration Register 


Register 


R_DdrxDdpDLLLaned 


Address 
0x0_0000_0054 (plus base address) 


May 14, 2014 486 Rev 51328 


SiCortex Confidential 


parti Reed SSCS 


Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 

Slave DLL to delay dummy DQS to match the 


23:16 | MasterAdj 


— 
_ 


DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 

Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 


8.4.8.23 R_DdrxDdpDLLLane6 - PHY Read Lane 6 DLL Configuration Register 


Register 


R_DdrxDdpDLLLane6 


Address 


0x0_0000_0058 (plus base address) 


8 a 


33: 16 | MasterAdj ae 


_ nie 
~~ 


8.4.8.24 R_DdrxDdpDLLLane7 - PHY Read Lane 7 DLL Configuration Register 


Register 


R_DdrxDdpDLLLane7 


Address 


0x0_0000_005c (plus base address) 


May 14, 2014 


Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 

Slave DLL to delay dummy DQS to match the 
DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 

Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 
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23:16 | MasterAdj RW 186 | Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 


15:8 | Slave0Adj RW 1 Slave DLL to delay dummy DQS to match the 
DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 
7:0 | Slavel Adj RW 12 Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 


8.4.8.25 R_DdrxDdpDLLLane8 - PHY Read Lane 8 DLL Configuration Register 


Register 
R_DdrxDdpDLLLane8& 


Address 
0x0_0000_0060 (plus base address) 


a 


33: 16 | MasterAdj Master Delay Adjustment - specifies the num- 
ber of slave adjustment steps. (See DLL 
description of DDP Unit for details settings 
based on clock frequency). 

15:8 | SlaveQAdj RW Slave DLL to delay dummy DQS to match the 
DQS board trace delay to and from DIMM. 
(See DLL description of DDP Unit for details 
on settings). 
7:0 | Slavel Adj RW 12 Slave DLL to delay DQS nomially by 1/4 
DCLK. (See DLL description of DDP Unit for 
details settings based on clock frequency). 


8.4.8.26 R_DdrxDdpDLLReset - PHY DLL Reset 


Register 
R_DdrxDdpDLLReset 


Address 
0x0_0000_0064 (plus base address) 


Definition 
Ee 


[Reserved 
Reset RW 1 Active high reset routed to each of the DLLs 
in the PHY. Direct access is provided for the 
DLL reset since the TrueCircuits documen- 
tation says that DLL fault testing should be 
done with the DLL reset asserted. 


8.4.8.27 R_DdrxDdpCKReset - Reset for CK clock outputs to DIMM 


Register 
R_DdrxDdpCkK Reset 
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Address 
0x0_0000_0164 (plus base address) 


pits Reed SSCS—~—SCSC*S 


Reset RW 1 Deasserting this CSR bit causes the PHY to 
start driving clocks to the DIMMs. Before 
deasserting this bit, software must make sure 
dclk and dmclk90 are stable and that ClkDriv- 
Imped of R_DdrxDdpCmdDrv is set to an ap- 
propriate value. 


8.4.8.28 R_DdrxDddRdDelay 
Register 
R_DdrxDddRdDelay 


Address 
0x0_0000_0068 (plus base address) 


: | {|__| Reserved 


[Reserved 
ste RW Setting this to a 1’b1 adds an extra cclk cycle 
of latency to the read return path as a debug 
mechanism to prove bugs are not due to read 
return fifo underflow. 


8.4.8.29 R_DdrxDdiMemLoopBack 


Register 
R_DdrxDdiMemLoopBack 


Address 
0x0_0000_006c (plus base address) 


Defnition 
aia | a a 1 a! 


“Te When this set to “1” read and write requests 
received by DDR will receive a fake completion 
response (i.e. will not really issue to mem- 
ory and will return meaningless data. This 
is only expected to be used during the initial 
boot sequence where it is possible for the reads 
and writes will show up at the DDR unit, that 
don’t need complete correctly. This is because 
of the boot sequence involves the boot proces- 
sor doing writes to the cache which will result 
in the caches doing reads for allocation before 
allowing the write (which it thinks is neces- 
sary for coherance). This CSR bit needs to 
be cleared before R-DdrxDdcDdpSoftReset is 
de-asserted and MemLoopBack must never be 
asserted when R_DdrxDdcDdpSoftReset is de- 
asserted. 
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8.4.8.30 R_DdrxDdiRdPathRst 
Register 
R_DdrxDdiRdPathRst 


Address 


CE ie Pines | Rees Re Deiitoe base aides 
Ens een Ol 


“eae This is NOT intended for general use. It is a 
hook for debugging potential issues with the 
PHY DLL settings. When asserted state ele- 
ments in the read return datapath are forced 
to their reset values. A read can not be out- 
standing when this is asserted, this must be 
deasserted before any read is issued to the 
DDR unit. When this CSR changes value, it 
is NOT allowed to change value again from at 
least 10 dclk cycles (Note: that this require 
should be meet by default since it takes at 
least 30 clock cycles to affect the same CSR 
with back to back SCB writes to it). 


8.4.8.31 R_DdrxDdiRdTimeOut 
Register 
R_DdrxDdiRdTimeOut 


Attributes 


-writeonemixed 


Address 
0x0_0000_0074 (plus base address) 


sis, SSSsidtsCi“(tsté*dSC*C~‘iRwawed—SCSC*~=“~*~S*s*‘“‘“*S*S*S*S*S*S~*Y 


Ly poe | | a the counters which are used to deter- 
mine if a read hangs. 


ie ease aed Causes the DDR unit to issue a false read com- 
pletion for reads the hang. See description of 
Read Time-Out AutoCompletion in the DDI 
subsection of this spec. 
RdHang RWI1C Set if a read has timed out. The value is sticky 
Lape eeu ey until software writes a 1 to clear it. 


8.4.8.32 R_DdrxDdpCalReset 


Register 
R_DdrxDdpCalReset 


Address 
0x0_0000_0078 (plus base address) 
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| 31:1 | 1 renal eee Reserved 


i RGSOLGU o oaee a e| 
pa When asserted, the calibratin logic in the DDR2- 
PHY will be held in reset. After deasserting 
the “dclk and cclk resets” which go to DDI, 
R_DdrxDdpImpedCal_CalClk should be set, then 
CalReset can be deasserted. 


8.4.8.33 R_DdrxDdpCalError 
Register 
R_DdrxDdpCalError 


Attributes 


-writeonemixed 


Address 
0x0_0000_007c (plus base address) 


Ee 


19 | CalUpdate RW1C Set when the calibration logic updates ImpP 
and ImpN 


18 | CalErrDerate RW 1 If the auto-cal logic very rarely asserts 
calfault_occur or cal_timout_occur, there may 
not be a problem. CalErrDerate allows users 
to cause the decrementing of CalErrCount ev- 
ery time the auto-cal logic runs for 524,288 
cycles without a cal_fault or a cal timeout. 

17 


CalErrInterrupt RW1C Asserted when CalErrCount has reached the 
IntReportThreshold. This bit is sticky until 
software does a write one to clear it. 


IntReportThreshold RW 5) Asserts an interrupt if CalErrCount goes 
above this specified value. (Valid values 1- 
255). 


CalErrIntEnable RW 1 Settng this bit enables interrupts to be re- 
ported for auto-calibration errors based on the 
settings of the other fields of this CSR. Set- 
ting this to zero forces both CalErrCount and 
CalErrCount to be zero. 


8.4.8.34 R_DdrxDdpCalEnable 


16:9 | CalErrCount R 8 bit saturating counter. Increments 
when ever the auto-calibration logic in 
the DDR-PHY asserts calfault_occur or 
caLtimout_occur. NOTE that CalErrCount 
automatically cleared whenever CalErrInter- 
rupt is cleared. 


Register 
R_DdrxDdpCalEnable 


Address 
0x0_0000_0080 (plus base address) 
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| 31:1 | 1 —————_—_——————— Reserved 


PROSOO - aan ee es 
“oe If CalEnable is low when DdcDdpSoftReset is de- 
asserted, the the intial calibration settings up- 
dated into the IOs will be the worst case SS cor- 
ner setting (relatively strong calibration settings) 
(impP=12, impN=9). If CalEnable is never as- 
serted, these values will be permanently used. Once 
CalEnable has been asserted, calibration values 
will be updated into the IOs according the set- 
tings of the other Cal related CSRs. If CalEn- 
able is deassertd at some point, the values of 
R_DdrxDdpImpedCal_LastUpdatedImpP/N. It is 
recommended that users not toggle CalEnable, but 
choose whether to leave it asserted or deasserted, 
and uses the finer grain controls of the DdrxD- 
dpImpedCal register to control update frequency 
and temporary disabling. 


8.4.8.35 R_DdrxDdpCalCounter 
Register 
R_DdrxDdpCalCounter 


Address 
0x0_0000_0084 Se Mn | eeess f Reset | Denton base pddress) 


El a ———————————————— Reserved 


[Reserved 
“oe 0 [eo Determines the period between IO calibration 
updates if AutoCalUpdate is enabled. Cal- 
Counter is the upper 16 bits of a 32 bit count 
down counter, thus it decrements once every 
65536 dclk cycles, thus a value of 1 means do 
an IO cal update once every 65536 dclk cycles. 
Setting this to zero means to do a cal update 
on the first opportunity after the calibrator 
has come up with a new value. When counter 
reaches zero it means to update the IO cali- 
bration on the next opportunity according to 
CalMode and OverrideAutoCalibrtion. 


8.4.8.36 R_DdrxDdpImpedCal 


Register 
R_DdrxDdpImpedCal 


Address 
0x0_0000_0088 (plus base address) 


Bit | Mnemonic Access | Reset |} Definition 
| Bit | Mnemonic [Access | Reset | Definition 
Pay S™sé~<C~dSSS—~—C Sid Red SSCSC~SCSY 
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0 | ManualCalUpdateOtol 0->1 transition tells DDI to update the IOs 
with calibration values based on CalMode and 
OverrideAutoCalibration at the next opportu- 


nity. This bit should not be used when Auto- 
CalUpdate is set. They are mutually exclusive 


ways of controlling calibration value updates. 


3 

29 | AutoCalUpdate 1 - DDI will update the IOs with calibration 
values based on CalMode and OverrideAuto- 
Calibration at the next opportunity after Cal- 

2 


Counter counts down to zero. 
0 - Software must specifically initiate cal- 
ibration value updates with ManualCalUp- 
dateOtol 
0 - CalClk = delk/2 
1 - CalClk = delk/4 
2 - CalClk = dclk/8 
3 - CalClk = dclk/16 
(Note: 
R_DdrxDdpCalReset must be asserted 
when changing 
the value of CalClk. 


28:27 | CalClk 
2. CalClk is required to be less than 300MHz) 


26:25 | CalMode See decision of IO calibration from more infor- 
mation on the CalModes: 
0 - update IO calibration during DIMM auto 
refresh operation. 
1 - update IO calibration during DIMM refresh 
operation, while zeroing the dram clk for one 
cycle 
2 - update IO calibration during precharge 
powerdown, while zeroing the dram clk for 
one cycle. Note that Cal Mode 2 requires 
R_DdrxDdiMifCfg2_PwrDnEnable to be set 
to 1 (otherwise the logic may hang wait- 
ing for a powerdown event which will never 
happen, and thus block forward progress for 
memory requests). This mode also requires 
R_DdrxDdcMemCfg1_PchPowerDown to be 
set to 1. 
3 - update IO calibration during dram self- 
refresh. 
CalModes 2 and 3 may have a noticable im- 
pact on performance if the CalCounter is set 
to zero or a small value. 


4 | OverrideAutoCalibration RW Override the auto calibration values computed 
instead update the I/O pads with the values 
OverrideImpP and OverridelmpN provided 


by this CSR. 


23:20 | OverridelmpP User supplied value for pull-up impedence cal- 
ina 

19:16 | OverridelmpN User supplied value for pull-down impedence 
[ic 
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The current calibration value loaded into the 
register which drives ImpP to the level shifter 
in the IO ring. 

The current calibration value loaded into the 
register which drives ImpN to the level shifter 
in the IO ring. 

Value determined by auto calibration logic 
which currently needs to be feed into the IO 
pads to 

adjust the pull-up impedence for outputs and 
input termination. 

Value determined by auto calibration logic 
which currently needs to be feed into the IO 
pads to adjust the pull-down impedence for 
outputs and 

input termination. 


LastUpdatedImpP 


LastUpdatedImpN 


Note that CalMode 2 is currently unsupported in general use. See bugzilla 2013, quoted here: 


When setting AutoCalUpdate in cal mode 2 (update during prechargePowerdown) the Ddi can hang. 
This is caused when a request is at the head of the queue requesting to be sent to the controller at the 
time we start the calibration update process. The calibration logic spins in place waiting for powerdown 
entry. However, this pending request causes the powerdown counter to be cleared on every cycle, which 
blocks the Ddr from ever entering powerdown mode. 


If CalMode 2 is used, provision must be made to 


either ensure that no memory references are outstanding at the 


time that a calibration cycle is initiated, or that some processor is capable of unjamming the autocal sequencer. If 


you don’t understand this, then note that CalMo 


8.4.8.37 R_DdrxDdpDataDrv 
Register 


R_DdrxDdpDataDrv 


Address 
0x0_0000_008c (plus base address) 


Bit Recess 
ie el a 


26:24 | DqBl8DrivImped i) 


Reset 


S 
= 


[20:18 | DaBI6Drivimped [_RW_ 
JE 


S 
or 
o 
Ww 
= 


or 
a 
a 


qBl5DrivImped | RW | 
qBl4DrivImped | RW | 
qBl3DrivImped | RW | 
86 | DaBprivimped [RW 
11DrivImped | RW | 

2:0 ODrivImped 


i) 
a 
= 


Z 


S| Oo] 9 
Be) 
= 


Z 


o 
a 
} 


S 
a 


qB 
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de 2 is currently unsupported. 


Reserved 

Byte Lane 8 Output Driver Strength 
111 - UNDEFINED 

110 - UNDEFINED 

101 - Tx Mode 60 Ohm (4.7mA) 

100 - Tx Mode 40 Ohm (7.0mA) 

011 - Tx Mode 24 Ohm (11.7mA) 
010 - Tx Mode 20 Ohm (14.0mA) 
001 - UNDEFINED 

000 - Tx Mode 17 Ohm (16.5mA) 
Byte Lane 7 Output Driver Strength 
Byte Lane 6 Output Driver Strength 
Byte Lane 5 Output Driver Strength 
Byte Lane 4 Output Driver Strength 
Byte Lane 3 Output Driver Strength 
Byte Lane 2 Output Driver Strength 
Byte Lane 1 Output Driver Strength 
Byte Lane 0 Output Driver Strength 


494 Rev 51328 


SiCortex Confidential 8.4. DDI SECTION 


8.4.8.38 R_DdrxDdpDQSDrv 
Register 


R_DdrxDdpDQSDrv 


Address 


0x0_0000_0090 (plus base address) 


pier CdYSd)CCd Reed SOS 


26:24 | Dqs8DrivImped RW +) DQS8 Output Driver Strength 

111 - UNDEFINED 

110 - UNDEFINED 

101 - Tx Mode 60 Ohm (4.7mA) 
100 - Tx Mode 40 Ohm (7.0mA) 
011 - Tx Mode 24 Ohm (11.7mA) 
010 - Tx Mode 20 Ohm (14.0mA) 
001 - UNDEFINED 

000 - Tx Mode 17 Ohm (16.5mA) 


[112 [DastDrivimped | RW [5 | DQS1 Output Driver Smength + 
[86 [Das2Drivimped | RW [5 | DQS2 Output Driver Smength __—_—+ 


8.4.8.39 R_DdrxDdpCmdDrv 
Register 


R_DdrxDdpCmdDrv 


Address 


0x0_0000_0094 (plus base address) 


Reset 


Output Driver Strength for address/command 
(A[15:0], BA[2:0], RAS, CAS, WE) 

111 - UNDEFINED 

110 - UNDEFINED 

101 - Tx Mode 60 Ohm (4.7mA) 

100 - Tx Mode 40 Ohm (7.0mA) 


011 - Tx Mode 24 Ohm (11.7mA) 
010 - Tx Mode 20 Ohm (14.0mA) 
001 - UNDEFINED 

000 - Tx Mode 17 Ohm (16.5mA) 


CntrDrivImped 5 Output Driver Strength for ODT, CKE, CS 
2:0 | ClkDrivImped 5 Output Driver Strength for CK 
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8.4.8.40 R_DdrxDdiPHYWrptrCopy - This read only CSR is intended to be used for debugging 
only. The values only become valid after the last outstanding read has completed. The 
pointer is gray coded. When all outstanding reads have completed, the value of the 
R_DdrxDdiPHYWrptrCopy is expected to be 0001, 0111, 1101, or 1011. 


Register 


R_DdrxDdiPHY WrptrCopy 


Address 


0x0_0000_0098 (plus base address) 


a ES ———————————————————— | Reserved ees 


mea 0 LT SEC Copy of the PHY’s fifo wr pointer. Value only valid 
when NO reads are outstanding. 


8.4.8.41 R_DdrxDdpHoldFix - This register has be included as a preventive measure. If it turns 
out that there are hold time problems with the sending of cmd/addr signals to the DIMM. 
Setting bits in this register muxes in delay elements to add additional hold time margin. 


Register 
R_DdrxDdpHoldFix 


Address 


0x0_0000_009c (plus base address) 


Es 


P 3 | DaayOor_[ RW [0 _[ Adds delay to chip selects 


DelayOdt | RW | 0 | Adds delay to odt signals 
DelayCke | RW | 0 | Adds delay to CKE 
| 0] DelayAddr | RW | 0 | Adds delay to address signals 


8.4.8.42 R_DdrxDdpHighSpeedTest - This CSR is only intended for use during chip testing, where 
a tester is acting as a DIMM. 


Register 
R_DdrxDdpHighSpeedTest 


Address 
0x0_0000_00a0 (plus base address) 
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Causes DDI to: 

1. Hold the wr_req line high so that it is constantly 
issuing write request to the the NWL memory con- 
troller (DDC). Each write request uses a randomly 
generated address. Note the address spans the full 
16GB logical address space. 

2. Whenever the NWL logic controller gives a write 
data grant, DDI will send in data a data pattern 
to the NWL logic controller such that the even DQ 
bits will toggle for the first for four DQS clock edges, 
and then the odd DQ bits will toggle for the last four 
DQS edges of the transfer to the DIMM. The DM 
bits will toggle every other DQS clock edge. 
Setting this HIGH forces a HIGH onto the 
dll_bypass_slave inputs to the DDR-PHY byte lanes. 
This is needed during high speed read testing of the 
DDR PHY so that the tester can drive a pre-shifted 
DQS (relative to the DQ) and directly write data 
into the DDR-PHY’s read fifo. 

NOTE: This toggles logic which crosses between two 
clock domains, thus all logic should be quieted for 
a few cycles before and after this signal is written. 
To meet this requirement the following is required: 
Tests that change the valure of the CSR are required 
to first issue a read to this CSR, folllowed by the 
write to this CSR and then followed by another read 
to this CSR. No other action is allowed until the 
second read of the written data comes back. 


8.4.8.43 R_DdrxDdiECCCaptureEnable 
Register 
R_DdrxDdiECCCaptureEnable 


Address 
0x0_0000_00a4 (plus base address) 


MO ne Ee SRE 


EnableRdECCCapture RW When asserted the CSRs 
R_DdrxDdiRdECCCapture0-1 will store the 
value of the ECC field of the last read sent out 
on the CSW bus. This should only be enable 
during DDR DLL calibration, and not during 
normal operation where more than one read can be 
outstanding at a time. 


[ea ClearRdECC ae ied When asserted causes R-DdrxDdiRdECCCapture0- 
1 to clear 


8.4.8.44 R_DdrxDdiRdECCCapture0 


Register 
R_DdrxDdiRdECCCapture0 
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Address 
0x0_0000_00a8 (plus base address) 


31:24 | Data3ECC aa a: 
23:16 | Data2ECC ele 


— ch 
~ a al 


8.4.8.45 R_DdrxDdiRdECCCapturel 


Register 
R_DdrxDdiRdECCCapturel 


Address 
0x0_0000_00ac (plus base address) 


iar ena ed 
ine aoa el 
i 
ie 
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Stores the ECC value of the last read data 
driven out on ddr_coh_Data3_c2a[71:64] (Cleared by 
R_DdrxDdiECCCaptureEnable_ClearRdECC) 
Stores the ECC value of the last read data 
driven out on ddr_coh_Data2_c2a[71:64] (Cleared by 
R_DdrxDdiECCCaptureEnable_ClearRdECC) 
Stores the ECC value of the last read data 
driven out on ddr_coh_Datal_c2a[71:64] (Cleared by 
R_DdrxDdiECCCaptureEnable_ClearRdECC) 
Stores the ECC value of the last read data 
driven out on ddr_coh_Data0_c2a[71:64] (Cleared by 
R_DdrxDdiECCCaptureEnable_ClearRdECC) 


Stores the ECC value of the last read data 
driven out on ddr_coh_Data7_c2a[71:64] (Cleared by 
R_DdrxDdiECCCaptureEnable_ClearRdECC) 

Stores the ECC value of the last read data 
driven out on ddr_coh_Data6_c2a[71:64] (Cleared by 


R_DdrxDdiECCCaptureEnable_ClearRdECC) 
Stores the ECC value of the last read data 
driven out on ddr_coh_Data5_c2a[71:64] (Cleared by 
R_DdrxDdiECCCaptureEnable_ClearRdECC) 
Stores the ECC value of the last read data 
driven out on ddr_coh_Data4_c2a[71:64] (Cleared by 
R_DdrxDdiECCCaptureEnable_ClearRdECC) 


This section instantiates two copies of the configuration registers for the two instances of DDR (DDRO and 


DDR1) 


8.4.9.1 Ddr0 
Register 
R_Ddr0* : R_Ddrx* 


Address 
OxE_4800_0000-0xE_48FF_FFFF 


8.4.9.2 Ddrl 
Register 
R_Ddr1* : R_Ddrx* 


May 14, 2014 


498 


Rev 51328 


SiCortex Confidential 8.5. DDC SECTION - DDR2 SDRAM CONTROLLER IP BLOCK 


Address 
OxE_5800_0000-OxE_58FF_FFFF 


8.4.10 Vregs_End_Of_Decl 
8.4.11 DDR Performace Events 


The following events are trackable by DDR statisticall event counting 


Enum 


DdrxEvent 


Attribute 


-descfunc 


8’h01 | CAS Number of Read and Write commands issued to DDR2- 
SDRAM. For analysis studies on the use of auto-precharge 
tests can be run with R_DdrxDdiMifCfgl_AutoPch = 
0. The difference (CAS - RAS) gives the total number 
DRAM accesses that hit on an open page within a bank. 
((CAS - RAS) / CAS) gives the ratio of total page hits 
over total DRAM accesses. 


8’h02 | RAS Number of Bank Activate commands issued to DDR2- 
SDRAM. 


8’7h03 | MEMRD Number of reads issued to the DIMM. 


8’h04 | MEMWR Number of writes issued to the DIMM 
8’h05 | MULTRDBIDS Cycles with more than one read request bidding for DDC. 
8’h06 | MULTWRBIDS Cycles with more than one write request bidding for DDC. 


8’h0O7 | RDANDWRBIDS | Cycles with at least one read and one write request bid- 
ding for DDC. 


8’h08 | POWERDOWN Number of cycles in powerdown. 


8’h09 | NEXM Number of attempted accesses to non-existent memory. 
(These are software errors which could cause data corrup- 
tion). 


FowM[—SSC~id Reserved SSCS 


8.5 DDC Section - DDR2 SDRAM Controller IP Block 


The DDC section contains a version of NorthWest Logic’s DDR2 memory controller customized for low latency. 
The read return data path has been removed. In our system, the core will pull read data directly out of the 
DDR2-PHY. The delay in the addres/CMD path has been reduced. 

Specifications can be found in the project tree at: 

.../hw/ip/northwestlogic/release 4##/documentation/ 

DDR2_SDRAM_Controller_Core_Datasheet#-#.pdf 

SiCortex_DDR2_Custom_Interface_Addendum##.pdf 

# - denotes version numbers, which may be different between the files and parent directory. 


8.6 DDD Section - Datapath interface to PHY 


DDD interfaces to the DDR2-PHY for extracting read data out of the PHY’s read data fifo. DDD also replicates 
control signals from DDC into copies which are pitch matched to the individual PHY datapath slices. 

Whenever DDP writes the first subcell of an entry of its read data fifo it toggles a signal which is sent to DDD. 
DDD synchronoizes this signal and begins pulling data out of the PHY’s read data fifo and drives it out onto the 
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CSW bus (setting R-DdrxDddRdDelay_DelayFifoReadOut = 1, will an extra cclk cycle of latency before the data 
is read out of the fifo (this is not needed, but provided as a debug hook)). DDD runs on the CCLK, but can keep up 
with the rate that data is written into the read data fifo since it can pull out of the fifo to utilize CSW bandwidth 
through pipelining onto the 72 B CSW bus while the fifo input rate is at the DCLK but the width is only 16B wide 
(8B each on rising and falling of DCLK). When DDD causes the assertion of ddr_coh_DataValid_c2a it grabs the 
CSW bus (i.e. - DDR does not need to abitrate for CSW access). 


8.7 DDP Unit - DDR2 SDRAM PHY IP Block 


8.7.1 Overview 


The DDP unit contains the DDR2-SDRAM PHY, which is a hard macro provided by designed by Esilicon. Some 
block diagrams and timing diagrams are located in the project tree under .../hw/ip/esilicon/doc/ddr2_phy_diagram_v#.pdf 
where # is the latest released version number. 


8.7.2 Clocks 


DDP receives two clocks DCLK and DM90CLK which is shifted minus 90 degrees relative to DCLK (i.e. 
DM90CLK is 1/4 cycle earlier). Both of these clocks originate from the one of the main PLL instances. Note that 
the PLL provides a pll_clock and plLclock90 output which is shifted by positive 90 degrees, so DCLK will be driven 
by plLclock90 for all of DDR/DDP and DM90CLK will be driven by pllclock. 

The clock which DDP drive to the DIMM is based on DM90CLK. The phase shift between the two clocks is 
used by the PHY in the write path logic to DDR2 spec requirement of DQS being shifted relative to DQ during 
writes. 


8.7.3. Address and Command Path 


DDP flops all address and command path inputs synchronously on the DCLK. These signals are then driven 
out the output pad to the DIMM. The command and address signals are: 

A[15:0] - Address 

BA[2:0] - Bank address 

RAS_L - RAS command line 

CAS_L - CAS command line 

WE_L - WE command line (write enable) 

CS_L[3:0] - CS command line (chip select (really rank select in our case)) 

ODT[3:0] - On-Die Termination 

CKE1:0] - Clock enable 


8.7.4 Write Path 


DDP flops all of its write path inputs synchronously on the DCLK. Some the of the write path signals are then 
flopped with DM90CLK. Please see ddr2_phy_diagram_v#-.pdf for logic diagrams. During writes DQS is driven out 
90 degress later than DQ. 


8.7.5 Read Path 


DDP’s read return path is customized to reduce read return latency. Read data returning from the DDR2 
DIMMs (DQJ71:0]) have an associated strobe clock DQS[8:0]. There are a number issues which need to be handled 
before the DQS can be used to capture the associated data. Firstly, because DQS is a bidirectional bus (driven by 
us during writes and driven by the DIMM for reads) it needs to be filtered so that it is doesn’t cause false data 
capture due to it toggling during writes or toggle due to noise when it is undriven. Secondly, DQS needs to be 
shifted so that it lines up with the data eye so that data can be correctly captured. The read datapath is repeated 
9 times corresponding to the 9 bytes of read data per read data received in parallel. 

In order to filter DQS, the DDR2-PHY needs to identify preamble and postamble of the read data transfer. 
The start of the read preamble is defined as 1 clock prior to the first rising edge of DQS furing a read burst, with 
no external delays (DQS aligned to CLK_M90). The read postamble ends 1/2 clock after the last falling edge of 
DQS during a read burst, assuming no external delays (DQS aligned to CLK_M90). The NWL controller (DDC) 
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sends signals to DDP to identify the timing of the preamble (see logic diagram and associated timing waveforms 
for CTLDQSA_PREAMB_ENABN and CTLDQSB_PREAMB_ENABN on ddr2_phy_diagram_v#-.pdf (location of 
this file is provided above in the overview subsection), also see the description of phy_dqs_preamble_en_n_a and 
phy_dqs_preamble_en_b in NWL’s DDR2 SDRAM Controller Core SiCortex Custom Interface Addendum). These 
signals are combined to create DDO_DQS_PREAMB_ENABN, which is then sent through a dummy instance of the 
differential I/O cell used for DQS to match delay variation due to PVT changes seen by DQS. The preamble enable 
then goes through a slave DLL which compensates for the board trace length round trip delay between the ASIC and 
DIMM (the delay setting for this DLL is controlled per byte lane by the CSRs R-DdrxDdpDLLLane#_Slave0 Adj). 
The output of the DLL enables the PHY to receive the DQS strobe and starts a 4 cycle counter which keeps the 
enabling the PHY to recieve DQS (the counter works becuase all reads are full 72 B (4 cycle) reads). 

Ideally, after DQS is filtered, its timing will match that of the DQ input after it has gone though similarly 
matched logic. It is then necessary to delay the DQS by approximately a quarter cycle so that it can be used as a 
capture clock for DQ. This delay is obtained from a second slave DLL (the delay setting for this DLL is controlled 
per byte lane by the CSRs R-DdrxDdpDLLLane#_Slavel Adj). 

The captured read data is place into a fifo which lives in the DDR2-PHY. The fifo is 4 entries deep, where each 
entry is 72 B wide. Each fifo entry has 8 sub-cells corresponding to each of the 8 data sub-transfers associated with 
a full 72-B read. Whenever DDP writes the first sub-cell of a fifo entry it tells sends a signal to DdrDdd to signify 
that it is safe to start pull data out of the next fifo entry. After proper sychronization, DdrDdd starts pulling data 
out of the PHY. 


8.7.6 DLLs 


Each of the 9 bytes lane of the a PHY instance includes an embedded analog DLL module from True Circuits 
based on their Part: TCI-TN90G-DDRLDLL. Each module contains one master DLL and two slave DLLs. Detailed 
information is located in the project tree at .../hw/ip/esilicon/release_11_19_05/dll_090g, in particular the document 
TCITSMCOO9DDRDDLLA1_guide.txt is very informative. 

Each DLL module contains 1 master DLL and two slave DLLs. The master DLLs 

Reference input frequency range: 93MHz - 465MHz 

Slave delay adjustment range: 0% - 100% of reference clock 

Number of slave adjustment steps (MADJ) - 160 (See below, DLL Master Adjustment section as Sam Stewart 
at Esilicon provided different info) 

Slave delay equation - Tf + [(ADJ + ADJ_offset)/MADJ] * Tref 

Fixed delay offset (Tf) (nom) - 90ps (this delay is cancelled by the match cell used for the DQS shift path.) 

Fixed code offset (ADJ_off) - 34 steps 


8.7.6.1 DLL Master Adjustment 


According to information provided by Sam Stewart at Esilicon, the MADJ setting is frequency dependent. 
Verilog simulations of the PHY seem to corroborate this. 

MADJ_MAX = (160 * 465Mhz) * Tref 

This implies the following settings should be used for R-DdrxDdpDLLLane#_MasterAdj: 

DCLK = 400 MHz (2.5ns) => MADJ = 184 

DCLK = 333 MHz (3ns) => MADJ = 224 

DCLK = 267 MHz (3.75ns) => MADJ = 252 (The formula says 279, but the MADJ is 8 bits wide (caps at 
255)) 


8.7.6.2 DLL range calculations for Slave0 (DQS preamble enable DLL to match board trace length 
to memory) 


DLL slave 0 adjustment range: 1-134. 


8.7.6.3 DLL range calculations for Slavel (DQS 1/4 cycle delay DLL) 


DLL slave 1 adjustment recommended settings: 
DCLK = 400 MHz (2.5ns) => ADJ1 = 12 
DCLK = 333 MHz (3ns) => ADJ1 = 22 
DCLK = 267 MHz (3.75ns) => ADJ1 = 29 
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Slave 1 setting = ((MADJ) / 4 ) - ADJ_offset, where ADJ_offset is the ADJ fixed code offset of 34 steps. Tf 
has been compensated for in the design. 


8.7.7 I/O pads 


The I/O cells used DDR2 are from ARM’s 90nm 1.2Gbps DDR1/DDR2 Combo Library for TSMC G. These 
ahve 1.8V drive, 1.0V Core interface for DDR2. 


8.7.7.1 Impedence Calibration 


The I/O cells include pull-up and pull-down impedence for driver strength setting and On-Die-Termination. 
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Counters, Performance Counters, & 
OCLA Overview 


[Last modified: $Id: counters.lyx 31059 2007-01-30 21:16:09Z pholmes $] 


9.1 What’s Available 


The Ice9 chip provides various ways to gain information on internal events and status. The SCB Bus provides 
access to internal status and counters to MSP (and SSP) from outside an Ice9, as well as to the 6 processors within 
Ice9. Processor code can read CPU Counters. And internal signals can be driven to an Ice9 external pin. 

This status information is provided by SCB Registers, the SCB Performance Counters mechanism, and OCLA 
(On Chip Logic Analyzer). Performance Counters and OCLA can be used in various ways. 

Simpler methods of gaining visibility take less configuration effort than the more complicated methods. In order 
by increasing complexity, these methods of gaining visibility into Ice9 are: 


e SCB register “good” and “bad” status bits within various sub-blocks of Ice9, many of which can cause inter- 
rupts. 


e SCB register counters within various sub-blocks of Ice9. 

e CPU Performance Counters, 2 in each MIPS core. 

e SCB Performance Counters used to get up to 2 configurable 32-bit counters. 

e SCB Performance Counters used to get up to 256 statistical-percentage counters. 
e OCLA driving internal signals out an Ice9 external pin. 

e OCLA used to get a highly-configurable 12-bit counter. 

e OCLA used to record a timeline of the times when an event occured. 

e OCLA used to capture trace and values informations like a logic analyzer. 


And of course you can use more than one of these at the same time. You could have the SCB register counters 
counting, at the same time that SCB Performance Counters is doing something, at the same time that OCLA is 
doing something. Let’s look at each of these in more detail... 


9.2 Status Bits 


Various Ice9 sub-blocks have “good” and “bad” status bits that can be read from SCB registers. 

For an overview on error conditions, and info on ECC errors, see the “Reliability, Availability, Serviceability, 
and Error Handling” chapter of the system hardware spec. This chapter can also be found under rev-control in 
<project >/specs/system/Reliability /Reliability.lyx. 

The PCI-Express unit has a “Link Up” bit, the 6 Fabric Link units have “MissionMode” bits. 
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Most sub-blocks have “bad” status bits of various kinds. Most of these can be enabled to drive interrupts to the 
processors. Even when a particular interrupt is not enabled, the status bit for that condition is usually readable 
over the SCB Bus. 


9.3. Counters 


Some Ice9 sub-blocks have counters locally-implemented (within the sub-block) that can be read from SCB, 
counting normal and error type events. Some sub-blocks rely entirely on the SCB Performance Counters for any 
counting you may wish to do, and some have both their own counters as well as SCB Performance Counters hookup. 

Locally-implemented counters are simpler to get information from than OCLA or SCB Performance Counters, 
requiring no configuration ahead of time, except in some cases they should be cleared at the appropriate step in 
boot process. Furthermore, they’re always “on”, giving a true count of their particular event. 

In the DMA sub-unit, a philosophy was taken that if counting was needed, DMA microcode could do the 
counting and store the values in memory. 

Fabric Switch counters are 32-bit, but counters in the Fabric Link are much smaller. 

These counters may not have been verified as well as the main functionality of the chip, depending on the 
sub-block. For some counters the count may not be exactly what would be literally correct during complex error 
conditions. But in general, during error-free conditions the error counters will remain zero and the good event 
counters will count correctly. And in general, during simple error conditions the error counters will count their 
respective errors correctly. 

As of September 2006, Link-unit counters have been verified for small counts, but not for large counts or 
rollovers. Nuances about their counts are documented in the Link Spec <project >/specs/ice9 /link/link.lyx. 

As of September 2006, Fabric Switch counters have been tested as correct during good traffic and simple errors, 
although during complex errors or periods of time not processing traffic the counts may be off. Fabric Switch 
counters are documented in <project >/specs/ice9/fabric/fabric.lyx. 


9.4 CPU Performance Counters 


Each of our 6 embedded MIPS cores has 2 configurable Performance Counters within it. 

See the MIPS Spec <project >/hw/cpu/opal_2_3/docs/MD00012-2B-5K-SUM-02.08.pdf section 6.22 
“Performance Counter Register”. Read this for the mechanism of how to use these counters, but read the “Processor 
Segments” chapter of our Chip Spec for the list of events. 

In the “Processor Segments” chapter <project>/specs/ice9/processor/processor.lyx see section “CPU 
Performance Counter Events”. Note differences between ICE9A and ICE9B. 


9.5 SCB Performance Counters 


836 different events or conditions are wired to the SCB Performance Counters mechanism, coming from many 
sub-blocks in Ice9, with strong emphasis on the processors themselves. There’s a good list of within-processor 
events to count, separately selectable for CPU0O through CPU5. In addition to these events wired directly to the 
SCB system, much of the OCLA triggering system is also available as events for SCB Performance Counters. 

SCB Performance Counters require configuration in order to be used, but it’s much simpler to use than OCLA. 

SCB Performance Counters are 32-bits, many more bits than the counters in OCLA. 

Not only can you choose from that long list of events, but you can condition any event by another event, 
counting only “if AND” the other event, or “if AND NOT” the other event. 

You can choose between “how many clocks was it high for” and “how many times did it go high for awhile”, or 
even “how many times was it high for more than N clocks”. Collecting that last version for more than one value of 
N, you can gather histogram information. 

Tests causing each event (that’s wired to the SCB Performance Counters) have NOT been written as of October 
2006, so some events may not work correctly. More to the point, although I expect most events to work correctly 
and count what you’d think they count, in a few cases the name of the event may not mean what you first think 
it does. When in doubt, ask the sub-unit designer what’s being counted (or asserted) by that event signal. 

There are cross-connections in both directions between SCB Performance Counters and OCLA, but those 
connections are not required for use of either. To keep SCB Performance Counters configuration simple, first see 
if the events you need are directly available in the AllEvents list. If not, then look at what events OCLA could 
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provide. Accessing OCLA events for counting is much simpler than the full use of OCLA, no OCLA LAC program 
is needed. 

See the “Serial Configuration Bus” chapter of the chip or hardware-system spec. This chapter can also be found 
under rev-control in <project >/specs/ice9/chipSCB/chipSCB.lyx. There’s a lot in that chapter, so look for 
the “Performance Counting” sub-section, and then the later “Performance Counting Registers” sub-section. 

In our Chip Spec there is no one list of all the events which can be counted by SCB Performance Counters. The 
best place to look for a nearly-complete list is in the software defines extracted file. As of January 2007 software 
defines for these are <project > /sw/include/sicortex/ice9 /ice9_all_spec_sw.h as enum Ice9_EnumAllEvent. 

The majority of SCB Performance Counters events are from inside the 6 processors. The list of “from the proces- 
sors” SCB Performance Counters events is found in the “Processor Segments” chapter <project > /specs/ice9 /processor /proc 
sections “SCB Performance Events” and “SCB Performance Core Events”. Note Ice9A vs Ice9B differences. This 
list is duplicated 6 times, once for each MIPS processor. 

OCLA events are not listed in the extraction. Although the hardware exists to count OCLA trigger-block events 
in SCB Performance Counters, it is not actively-supported or documented at this time. 

The SCB Performance Counters mechanism can be used in 2 quite-different ways, for “ordinary counters”, or 
for “statistical percentage counters”, as described below. 


9.5.1 Ordinary Counting with SCB Performance Counters 


If you want “a count of how many times something happened” for one or two of the many events wired to the 
SCB Performance Counters, you can configure this mechanism to dwell on those events continuously, giving you a 
“full count” of how many times those events occurred. 

There’s a limit of 2 events at a time. 

If you want an event conditioned by another, then those 2 events have already used-up your limit of watching 
only 2 events at a time. 

You may wish to be careful to remain off-of the SCB Bus during the time-period you're interested in. Any SCB 
writes or reads create short “black-out times” when your events may occur but not be counted. 

Prior to the time-period of interest, use SCB-writes to configure SCB Performance Counters for the events you 
want, and for a large dwell-time, and no incrementing of the bucket-number. Then, after the time-period of interest, 
use SCB-reads to get your counts. 

Events coming from clock-domains other than cclk (like FSW) will be counted correctly. 


9.5.2 Statistical Counting with SCB Performance Counters 


Your choice of up to 256 of the available events (AllEvent and OCLA events) can be scanned with a given 
configuration of SCB Performance Counters. 

In this style-of-use the goal is to get an estimate of “activity density” or “statistical percentage-of-time” of the 
events. For each selected event-signal you will be able to get a rough estimate of what percentage of time that 
signal was true. 

With this information you could compare different events to see which was occuring more often or more of the 
time. When tuning or diagnosing performance, you can see percent utilization of an Ice9 sub-block, or an interface 
from one sub-block to another. 

The SCB Performance Counters mechanism scans across the configured events, dwelling the same amount of 
time on each. After a period of scanning, you read the counts for each event. You can compare them, or divide 
these counts by the number of cclks spent watching for each, to get a percentage-of-time asserted. 

This style of use does not get you a “full count” of events, because the mechanism was scanning across events. 
For any one event, most of the time that event wasn’t being watched. 

This style of use 7s protected against black-out times when SCB writes or reads are taking place. The dwell 
time of watching for an event doesn’t count time periods when SCB writes or reads are happening. 

To get good statistics, the activity of interest should be more-or-less in a “steady state”, and then SCB Per- 
formance Counters should be configured to dwell long enough on each event to get a representative sample, as 
described in the “Serial Configuration Bus” chapter. 
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9.6 OCLA 


OCLA (On Chip Logic Analyzer) was designed to capture values of many signals in response to a simple or 
complex trigger event, but it can also be used in simpler ways. OCLA is provided with a large number of signals 
and busses from many Ice9 sub-blocks. With these you can form simple or complex triggers, and select which 
groups of signals you wish to capture in Collector Blocks for later viewing. 

OCLA can also trigger on up-to 2 of the many events provided to SCB Performance Counters, and can combine 
those events with OCLA’s own events in an AND-OR-delay manner to form triggers. But it’s simpler and usually 
adequate to use OCLA’s own large selection of trigger signals. If you configure OCLA to use SCB Performance 
Counters events, this “ties up” the SCB Performance Counters mechanism, in that any counting done by SCB 
Performance Counters must be on those same events. Furthermore, you must manage your SCB writes and reads 
to avoid missing events you wished to trigger on. No such management of SCB accesses is needed if you use OCLA’s 
own trigger signals. 

The OCLA Spec is the “On Chip Logic Analyzer” chapter of the chip or hardware-system spec. This chapter 
can also be found under rev-control in <project>/specs/ice9/chipocla/chipocla.lyx 

OCLA is fairly difficult to program. Expect a learning-curve. Your first OLCA program will likely not work 
at all. Example programs have been written and made to work in simulation for each of the Ice9 sub-blocks 
containing OCLA, for many trigger signals, and for various styles-of-use of OCLA. When writing a new OCLA 
program it’s recommended that you get it going in simulation first, then transfer it to the lab. Even experienced 
OCLA-programmers often resort to simulation-waves to debug a non-working OCLA program. The lab, of course, 
doesn’t have such visibility. 

The OCLA wiki page http: //apollo.sicortex.com/swiki/OclaVerification lists working example programs 
and where to find the code for them. You can use the Makefile there to create a Diagnostics “dash” perl script with 
the configuration of any of these programs. This perl script gives you the same OCLA configuration in the lab as 
was in the simulation test. These perl scripts are fairly readable and can be edited if you know OCLA well enough. 


9.6.1 OCLA Driving an External Pin 


OCLA can be configured to drive any 1 of 100’s of internal signals to Ice9 external pin “sys_ocla_trig”. 

The signals to choose from are those leading into the OCLA Trigger Blocks, as described in the OCLA Spec. 

The occurance of SCB Performance Counters events may also be driven out this pin. 

Logical combinations if signals and pattern-matching on busses can be combined to determine when to drive 
this pin. 

This can be useful to: (a) gain visibility inside the ASIC as to whether or how-often an internal event is 
happening, (b) trigger lab logic-analyzer equipment at the correct time to capture external busses data. 

To do this a small OCLA LAC program is required, as well as configuring one or more Trigger Blocks. 

There will be a fixed multi-clock delay, of some 20 to 40 nSec depending on the signal, between activity on your 
selected signal, and that same activity on “sys_ocla_trig”. 

Signals from the FSW unit, and events from SCB Performance Counters, will be distorted due to clock-domain 
crossings, and the need to stretch short pulses so they don’t dissapear as they enter the cclk domain and pass 
through OCLA. Isolated high pulses will not be lost, but sometimes 2 closely-spaced pulses from FSW or SCB 
Performance Counters will merge into one pulse. 

The fastest oscillation of this output is at 1/2 cclk frequency. Quality of viewed waveform will depend on how 
well the signal is kept close to a ground signal as it passes from ASIC, through board, into scope probe, into scope. 
With a couple inches of distance along the way not twisted with ground you can still tell the difference between 
actual pulses driven high and ringing/reflections. 


9.6.2 OCLA as a Counter 


One simple use of OLCA is as a counter. 

Only a very simple LAC program is needed, but even so, it’s usually less configuration to feed the signals or 
triggers to SCB Performance Counters, and do the counting there. OCLA is more flexible, but SCB Performance 
Counters is pretty powerful. If you wish to count one trigger qualified by another, SCB Performance Counters 
can do that. If you wish to count one trigger qualified by a delayed or advanced version of another trigger, SCB 
Performance Counters can do that, with the delays being applied in OCLA LAC before the triggers are sent to 
SCB Performance Counters. 
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SCB Performance Counters are 32 bits whereas OCLA counters are only 12 bits. Fortunately OCLA counters 
have a sticky overflow-bits indicating when over 4095 counts occured. 

You can have a 24-bit counter by nesting OCLA’s two counters in OCLA LAC program loops, but you get a 
slightly imprecise count, because it’s not watching for the event every time you “carry” from the lower-bits counter 
to the upper-bits counter. 

One motivation to count in OCLA rather than SCB Performance Counters is that SCB Performance Counters 
has black-out periods (missing counts) whenever an SCB write or read is in progress. 

Another motivation to create a counter in OCLA is if SCB Performance Counters is already in use, or if you 
wanted more than 2 continuously-counting counters. 2 continuous full-count counters in SCB Performance Counters 
plus one in OCLA gives you 3 at once. 

2 in SCB Performance Counters plus 2 in OCLA gives you 4 at once, but OCLA cannot increment both of 
OCLA’s counters in the same clock, so you’d have to decide which count gets incremented and which doesn’t if 
both events happen at the same time. If the 2 events are known to not happen on the same clock, there’s no 
problem. If the events are sparse and unrelated you could just accept one of the counts being inaccurate. If the 
events would predictably occur on the same clock, you could delay one of them with LAC delay regs. 

The real power of counting with OCLA trigger blocks is configurability. You can “design your own counter”! 
OCLA can be configured to count an AND-OR combination of many signals, even delayed signals! OCLA can also 
be configured to only count when an address, state-encoding, or packet header information on a bus matches one or 
more values or address-ranges. Much of this can be fed through to SCB Performance Counters, with the counting 
done there, but the full AND-OR, combination flexibility is only available by counting in OCLA. 


9.6.3. OCLA as a Times-of-Occurance Recorder 


Using OCLA’s free-running-counter to collect time-stamps in a collector block, you can get an “event timeline” 
of any OLCA-trigger event. As stated before, this event can be an AND-OR combination of signals or delayed 
signals, including pattern-matching on addresses, state-encodings, or packet headers. 

Up to 1024 event-timestamps can be collected. You lose any events after that. 

The free-running counter is 32-bits, so timestamps for up-to 2**32 cclks (16 seconds) are non-ambiguous. An 
unambiguous time-record for more than 16 seconds (with no upper limit in time), for 1024 or less occurances of the 
event, can be had by writing some watching software for one of the processors that periodically reads some OCLA 
registers. 

You usually get no useful logic-analyzer type collection of values occurs when using OCLA in this manner, all 
you get is a series of timestamps. 


9.6.4 OCLA as a Logic Analyzer 


The full use of OCLA’s capabilities is to collect values of many signals and busses in response to a simple or 
complex trigger event. The OCLA Spec is where this is described in detail, but here are a few highlights: 

Up to 1024 cclks of activity may be collected. 

This collection can be done for one period of time of 1024 cclks, or multiple smaller time periods can be collected. 

A “qualification” feature allows some data (but not all) to be collected “only when valid”, which is very efficient. 
This allows many short events, or single-clock events over a long stretch of time to fit in a 1024 entry collector 
block. 

A period of collection can programmed to be mostly prior to the trigger, centered around the trigger, or after 
the trigger. 

Activity can be collected in more than one sub-block of Ice9 at the same time. 

Although there are many choices of sets of data to collect, they represent only a small fraction of the signals 
and busses in the sub-blocks of Ice9. We did the best we could to choose “likely to be useful” data to wire-up to 
OCLA, but by hindsight we already see we could have made some better choices. Hopefully what you need will be 
there. 
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Chapter 10 


Serial Configuration Bus 


[$Id: chipSCB.lyx 50693 2008-02-07 16:01:46Z wsnyder $] 


10.1 Overview 


The Serial Configuration Bus (SCB) is a small serial bus used to interconnect software programmable registers 
(aka CSRs or slow I/O registers) throughout the chip. 

The SCB is controlled by the SCB Master block. The SCB Master (SCBM) interprets CPU reads and writes, 
and converts them to a serial bus. The serial bus is driven to the first in a ring of SCB Slaves (SCBS). It eventually 
reaches the desired slave, which performs the read or write and drives the data further along the ring and finally 
back to the SCB Master. 

The SCB also implements performance counters, which statistically sample monitoring points across the design. 


10.2 Specifications 


e Up to 128 slaves. 
e 32-bit data. 


e Up to 24 bits of configuration register address space per slave. 


Low-cost 3-signal interconnect. 

e SysChain interface for module processor access to all slave registers. 

e Synchronous clocking in each required clock domain. 

e Standardized Slave interface, for easy instantiation. 

e Low cost reset of all I/O register state. 

e Performance sampling interface, with up to 256 different events per slave. 

e All of OCLA events, plus enum AllEvent available for performance counting. 
e Any of the performance events visible at OCLA, for logic analyzer triggering. 


e System Manager interface registers for LEDs, Attention, chip number, etc. 


10.3. Differences, Bugs, and Enhancements 


10.3.1 Product and Chip Pass Differences 
1. ICE9B returns a different product (ICE9B) and/or revision (ICE9A1 vs ICE9A0) when reading R_ScbChipRev. 


2. ICE9B has reduced latency accessing the SCB’s own registers. 
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3. ICE9B adds a interrupt/attention for when the Chip<->Msp channel is ready for transmit. 
4. ICE9B adds R_ScbDInt to replace the SysChain R_SysTapDint register, see bug2223. 

5. TWC9A returns a different product (TWC9A) and/or revision when reading R_ScbChipRev. 
6. NEED IMPL: TWC9A supports 64 bit SCB slaves and 64 bit registers, see bug4619. 


7. TWO9A adds R_ScbDInt_SendDInt6, R_ScbDInt_Cpu6DM, R_ScbAtnInt_Cpu6DM Mask, and R_ScbAtnInt_Cpu6DM 
to support CPUs 6-9. 


8. TWCO9A fixes reads to fast DDR clock registers returning the wrong results after a CCLK register read, 
bug4331. Earlier chips required a dummy read between such read sequences. 


9. TWC9A will skip sampling bucket pairs where R_ScbPerfBuckets_Event == AllEvent_INVALID. This is 
backward compatible with other products, which should use that encoding for invalid buckets. bug4265. 


10.3.2. Known Bugs and Possible Enhancements 


1. In ICE9A and ICE9B, all SCB accesses must be done with 32-bit accesses. Using a 64-bit read/write to 
access them will put return/write data in the wrong half of the quadword, not simply return or write half of 
the data. 


2. Decouple the SCB CPU#_P/[01] events from the CPU performance counter domain (U/S/K), perhaps with 
new domain bits. 


3. SCB performance counts from Ocla TrbC blocks depend on the TrbC configuration, this could be simplified. 
bug1717. 


4. R_ScbPerfEna should have a way to stop immediately, without corrupting, for interrupt handlers. Perhaps 
add a Pause bit that stops on current bucket and partial interval. We’ll also need to make the partial interval 
programmable so context switches can reprogram it. 


5. R_ScbPerf* registers should be writable without needing to stop sampling. 


6. R_ScbInt should indicate what bucket(s) have caused the overflow, to save software from having to read the 
entire count ram on each overflow, bug2164. 


7. R_SysTapMsp transactions should be double buffered, as the Msp decision loop is quite slow. 


8. R_ScbInt like most of the other blocks in the chip contains the interrupt state before masking. This requires 
the interrupt handler to read (or cache) R_ScbIntMask before dispatching interrupts. 


10.4 Block Diagram 
10.5 SCB Master Ports 


Signal Name 


pmi_scb_req_cr Pmi Scb request pulse. Pulsed to request a Scb transac- 
tion, -wr, addr, and _wdata are valid until acknowledged. 


pines adie Pmi Scb request read/write address. 
Pini Scb request write, not read: 
prices = Ew Pmi Scb request write data. 


scb_pmi_ack_cr Out Pmi Scb return acknowledge. Pulsed to indicate comple- 
tion of transaction, and _rdata is valid. 


scb_pmi_rdata_cr Pmi Scb return read data. 
scb_csw_ScbInt_ca Pmi Scb interrupt. Asserted while interrupt requested. 


scb_chaino_dat_*r first SCB Slave | Serial SCB chain output (one per clock domain) 
last SCB Slave | Serial SCB chain input (one per clock domain) 
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Figure 10.1: Scb Overview 


10.6 SCB Slave Ports 


SCB SLAVE PORTS 


The SCB Slave is a standard Verilog/SystemC module that is instantiated by blocks to decode the serial bus 


into connections for the local block’s register logic. 
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Signal Name 


chaini_scbs_dat_r[2:0] previous SCB Slave Serial SCB chain input. 
scbs_chaino_dat_r[2:0] next SCB Slave Serial SCB chain output. 


scbs_x_active_r Out slave user Transaction active. May be used as a 
clock gate for slave logic that only needs 
to be active during SCB activity. As- 
serted starting with the initial req_r as- 
sertion through acknowledgement, and 
during event counting. 


slave user Read/write request pulse. Indicates ad- 
| esse datas atabie 
slave user Decoded address, for register accesses 
[eaten otsinpie pot 
scbs_x_wr_r slave user Write/ not read. Asserted for writes, de- 
(aac | asserted for reads. 


scbs_x_wdata_1r[31:0] | O Write Data. 32-bit data bus for writing. 


x_scbs_ack_r In slave user Read/write acknowledge pulse. Pulsed 
to acknowledge write, or read data is 
ready. 
eache rater ST0 
In ] 


x_scbs_id|6:0] slave user Identity. Specifies constant 7 upper ad- 
dress bits that must match address to 
accept SCB transaction. See 16.6.6. 


scbs_x_counting_r[1:0] | Out user events Asserted when the events are being 
counted by the SCBM. May be used to 
gate latching of last-event addresses, etc. 

scbs_x_eventId0_r[7:0] | Out user events Event number to route to 

OE | ete 

scbs_x_eventId1_r[7:0] | Out user events Event number to route to 

eT P| ee 

x_scbs_event|0] In user events Count bit A. Level asserted to count 

een P| etn ever tie 

x_scbs_event|1] In user events Count bit B. Level asserted to count 

ee EE Fitton ti yt 


10.7 Custom/Large Structures 
Description 


ScbCntRam | 256x50 lrw | Counting RAM, size based on number of sampling points, so easily 


negotiable. 


10.8 I/O Operations 


The SCB master connects to the system via the PCI Host interface, which receives I/O read and write trans- 
actions from the CPUs. When the SCB master detects an I/O write to its 32-bit address space, it initiates a SCB 
I/O write operation on all of the SCB busses. 

The address and data are shifted onto the SCB buses. One of the 128 SCB slaves decodes the address, and 
asserts a request to the SCB slave user’s logic. The user logic writes the register and asserts a strobe back to the 
SCB slave. The slave returns the acknowledge back over the SCB bus to the SCB master. 

On a read, the address is shifted onto the SCB bus, and is decoded by one of the SCB slaves, which asserts a read 
request to the SCB user logic. The user logic reads the registers and returns the read data and acknowledgement 
to the SCB slave. The SCB slave shifts out the acknowledgment and data back to the SCB master, who returns it 
to the system bus. 
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10.8.1 No responder 


When a SCB slave sees a transaction to its address space, it asserts Cmd|0] back to the SCB master. Should no 
slaves respond in this way across all of the domains, the SCB master will acknowledge the transaction itself (since 
no slave will ever respond.) On writes, this means the write will be silently dropped. On reads, the return data 
will be zero. 


10.8.2 Approximate Latency 


The approximate read latency of SCB operations is calculated below. Currently the sclk is the both the slowest 
chain and the chain with the most loads (8). This yields a minimal latency estimate of 210 ns. 


Who 
~2-5 cclks Latancy across Fsw And Cac 
[Sebi Overhead 


Scb 20 Xclks + #slaves | Time to See ren TeESE RETR command. This is the maximum across all 
clock domains. 


3++ Yelks Time for slave to respond to request for data. 


Fsw ~2-5 cclks Latency across Fsw and Cac. (Due to bus-stop organiza- 
tion, this is likely to be smaller if the above Fsw latency 
is large, and vice-versa. 


Read retum Tatancy 


10.8.3 Software Notes 


The SCB registers must be accessed with 32-bit load/store operations. Other size operations are not supported. 


10.9 SysChain Interface 


The registers on the SCB bus may be accessed over the SysChain interface. This may be done at any time; it 
is round-robin arbitrated with the normal Pmi path. 


10.9.1 SysChain Access Requirements 


To access SCB registers via the SysChain bus: 
1. SCBM/BBS reset must be deasserted. SCB slaves may still be in reset. 
2. All clocks with SCB chains must be running, not just the cclk and destination slave clock. 


3. Software must ensure that one SysChain write/read completes with “done” before the next is launched, or 
must request a reset between transactions. 


4. An old transaction may be shifted out simultaneously with a new command shifting in. 


10.9.2 SysChain SCB Write 


To write a register on the SCB chain, the address and data is prepared in the R_SysTapScb structure. The 
write and go bits are set, and the structure is shifted into the SCB SysChain interface. The SCB will decode the 
command and see the go bit set. It then performs the IO write as described above. On completion, the command 
register may be shifted out; the go bit will now be clear, and the done bit will indicate if the write was completed. 
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Figure 10.2: Scb Performance Counting 


10.9.3 SysChain SCB Read 


To read a register on the SCB chain, the address is prepared in the R_SysTapScb structure. The write bit is 
cleared, the go bit is set, and the structure is shifted into the SCB SysChain interface. The SCB will decode the 
command and see the go bit set. It then performs the IO read as described above. On completion, the command 
register may be shifted out; the go bit will now be clear, and the done bit will indicate if the read was completed; 
if so the data field contains the read data. 


10.10 Performance Counting 


When not being used for an I/O operation, the SCB bus may be used for counting events and performance 
monitoring. 


10.10.1 True Counting 


SCB Performance Counting can provide you a full count of how many times up to two events happened. You 
configure buckets 0 and 1 only, and don’t enable incrementing to the next pair of buckets. Even if the SCB slaves 
selected are in a different clock domain from the SCB master, an accurate count of events at the SCB slave will be 
tallied. The only events you miss are those that occur during an SCB bus I/O operation, so you should be careful 
to manage SCB bus use during accurate counting. 


10.10.2 Statistical Counting 


Up to 256 events can be counted in a statistical manner, watching for each for an equal amount of time. 

When enabled by R_ScbPerfCtl_Run, the SCB starts with bucket 0. The R_ScbPerfBuckets[0] register is loaded, 
which directs the SCB to select a given event number to sample into that bucket, see 10.17.9. In Twc9a-, if the 
event number is INVALID, the SCB skips the rest of this description and moves onto the next bucket. 

The event number is shifted to all of the SCB slaves. The slave corresponding to that event then routes that 
event’s state to the data wires, which propagates back to the SCB master. The SCB master increments a counter 
each cycle the data wire is true, thus calculating the number of cycles the event was true. 
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To allow for better debugging and tracking of cross products, the SCB can determine how long a signal was 
asserted on two such events at once, one on each of the two serial data wires. While R_ScbPerfBucket[n] is being 
counted, the event in R_ScbPerfBucket[n+1] is simultaniously being counted. 

After a programmed delay in R_ScbPerfCtlInterval, the SCB adds the event counter to the total in the 
R_ScbPerfBucket_Count[0] (and [1]) register, see 10.17.10. It then increments the bucket number by two and 
begins the process again with the event in R-ScbPerfBucket_Count([2] (and [3]). 

In this way, over time, the SCB has a statistical average of how often each event occurs. To reduce sampling 
errors on events which are asserted for long times, 1K cycles seems a reasonable minimum sample interval per 
bucket. At this interval we can go through all buckets at 250Mhz * 2 events at once / 256 buckets / 1K cyc/event 
= 488 samples per second. (This ignores the minor overhead in switching between events, so the real figure is ~4% 
smaller.) 

Once you have a count of events at an SCB slave in a different clock domain from the SCB master, if you want 
to calculate the percentage of slave clocks when the event was true, you must factor-in the ratio of clock speeds 
between SCB master and slave. 


10.10.3 Counts Causing Interrupts 


The software can configure interrupts when the event counters set a certain count bit number. For example, if 
R_ScbPerfCtlLIntBit==31, a interrupt will be raised exactly when an event causes its counter to count above 2°31. 
(Not while it is above 2°31, but when the event itself occurs.) Software then clears the interrupt. 

Note the interrupt for event x overflowing may be signaled before R_ScbPerfCounts|x] is written with the 
overflowing value. Software should poll R-ScbPerfBuckNum in the interrupt handler to see it increment once if it 
relies on R_ScbPerfCounts[x] to indicate what bucket(s) overflowed. 


10.10.4 OCLA Triggering 


From SCB Performance Counters to OCLA: 

Both of the final count wires, as seen by the SCB master, are routed to the OCLA. These two signals add to 
the large collection of things OCLA already has to trigger on. These provide OCLA the ability to trigger on any 
of the events SCB Performance Counters can count. But, in order to do so, SCB Performance Counters must be 
configured to dwell continuously on the one or two events that OCLA wants to see. 


10.10.5 Events from OCLA 


From OCLA to SCB Performance Counters: 
All of OCLA’s TRBC or TRBV triggers, and the raw signals from TRBVs, are available to SCB Performance 
Counters as events to be triggered on. These events are in addition to those listed in enum Ice9_AllEvent. 


10.10.6 Arbitration 


SCB I/O operations and event counting require the same SCB slave data wires. 

To avoid conflict, when a SCB I/O operation occurs, the current event count will be suspended, the SCB I/O 
operation performed, and the same event count restarted from where the count ended. In the end, the event will 
have been sampled for the same number of cycles as if it had never been interrupted. The interruption may cause 
minor inaccuracies in the counting, but should be negligible given how infrequently SCB accesses will occur. 


10.10.7 Software Notes 


Each event is loaded into a 32 bit count register. To prevent overflow, these counters must be sampled at least 
every 4G/500 MHz = 4 seconds. (It is more typically 10,000 seconds, as in normal operation each event is only 
sampled for 1/2048th of the time, but the SCB may be programmed to count only a single event forever.) Software 
should sample significantly faster then this (once per second), and derive the rollover bits to present a 64-bit counter 
to the upper level application. 

The best presentation to the user is probably as string-indexed values. The strings will be automatically 
extracted from the enum declarations in the specifications by the vregs package. 

All performace registers are in a unique 64KB page to allow software to map only the performance counter 
physical page into user visible virtual address space. 
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10.10.8 Writing while Counting 


Generally software should stop the counters before writing them. If, however, the counters are running, the 
table below describes the potential hazards. Note writing the same value never has an effect; the table only applies 
when the value to that field will change. 


Takes effect immediately. No hazards. 
Takes effect immediately. No hazards. 
Takes effect. at the end of the current interval. 
Takes effect at the beginning of the next interval. 
R_ScbPerfCtlInterval When a count is in progress, changing the interval may 
fewrevecer | make the counter overflow. Not recommended. 
R_ScbPerfHist_HistGte Takes effect immediately. If the bucket being sampled is 
using histogram, the count currently being calculated may 
spuriously count or lose a few events. Not recommended. 
R_ScbPerfBuckNum_Bucket | If R-ScbPerfCtlnoInc is set, the written value will be used 
when the next interval begins. If R-ScbPerfCtlnolnc is 
clear, the written value, or 2 plus the written value may 
be used when the next interval begins. 
R_ScbPerfEna_Ena Writing a one has no effect, as counting is already running. 
Writing a zero requests disabling counting when the next 
complete round of sampling completes. 


R_ScbPerfStat_Run Read-only. No hazards. 


R_ScbPerfBuckets_Event Takes effect the next time the specific bucket starts or 
resumes counting. 


R_ScbPerfBuckets_IfOther Takes effect immediately. If this bucket is the one being 
counted, the count currently being calculated may spuri- 
ously count or lose a few events. Not recommended. 

R_ScbPerfBuckets_Hist Takes effect immediately. If this bucket is the one being 
counted, the count currently being calculated may spuri- 
ously count or lose a few events. Not recommended. 

R_ScbPerfCounts_Count If this bucket is not the one being counted, the value will 
remain. If this bucket is the one being counted, the new 
count may be used, or the value may be overwritten with 
the pre-written value plus the count from the current in- 
terval. 


10.11 Connecting to SCBS 


10.11.1 List of Slaves 


The ICE9 has slaves across most of the chip. A complete list of slaves is listed in the AddrSublId enumeration 
in 16.6.6. Any row with a clock specified in the Clk column includes a Scb slave. 


10.11.2 Slave I/O Transactions 


Slaves connect their I/O registers to the SCBS using a simple request/acknowledge interface, with only one 
transaction ever outstanding. On a single cycle pulse of the scbs_x_req_t line, user logic decodes the address, write- 
not-read signal, and write address if applicable. When the user logic has completed the operation, it drives read 
data if applicable and pulses the x_scbs_ack_r line. The scb_ack_r must be pulsed after every scb_req_r, even if the 
address does not correspond to any valid register address. Additionally, invalid read addresses should return 0. 


10.11.38 Slave Performance Counting Interface 


Each slave uses the scbs_x_eventId#_xr signals to select which event is to be counted. The event is returned to 
the Scb slave counter as a single bit. The lowest cost way for user logic to implement this is probably a combination 
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Figure 10.3: SCB Slave Timing 


of muxes and AND gates which return a 0 whenever the address doesn’t match the desired event. A tree of these 
in each sub-block then feed a reduction OR tree, or see 10.1. Up to 8 flops may be introduced by the user logic at 
any point in this computation, as the SCB will discard the earliest sampling cycles. 

If any slave has additional registers related to performance counting, those registers should be in a unique 64KB 
page to allow software to map it into user virtual address space. 


Algorithm 10.1 SCB User Event Counting Example 
always @ (posedge clk) begin 
if (scbs_cpu_active_pr) begin // Clock gate 
m_scbs_event_p <= {eventMux(scbs_cpu_eventId1_pr), 
eventMux (scbs_cpu_eventId0O_pr)}; 
end 
end 
function eventMux; 
input [7:0] select; 
case (select) 
‘E_XxxScbEvent_CYCLES: eventMux 1’bi; 
‘E_XxxScbEvent_DCHIT: eventMux = (signal_high_on_DC_hit) ; 
default: eventMux = 1’b0; 
endcase 
endfunction 


10.12 SCB Internals 


This section describes the SCB internals. 


10.12.1 PMI Interface 


The connection between the PMI and the SCB master is a simple pulsed request / acknowledge handshake. The 
request and acknowledge handshake is nearly identical to the slave interface, with the addition of the upper address 
bits. See Figure 10.3. 


10.12.2 SCB Bus Protocol 


All slaves on a particular SCB bus all operate on the same clock domain; additional chains are used for each 
unique clock domain. Thus there are multiple SCB chains on the chip; presently for the pclk, cclk, dOclk, diclk, 
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and sclk. (We can never have a iclk chain, as the iclk is not running in non-PCI connected chips.) The master 
contains the synchronizer flops between the cclk (Scb master’s domain) and the slave bus’s domain. This places all 
of the synchronizers in one place, and is more logic efficient then requiring each slave to have a synchronizer. 


Each SCB bus consists of 3 wires connected in a chain, plus the clock. The 3 bits of the data bus consist of two 
logically seperate signals, the command and data bits, that are bussed simply so the top level interconnect need 
only concern itself with a single bus. 


Signal Name 
scbachaini-dat [0 
scbachaini=dat 2 


10.12.83. ICE9 Bit Sequence 


Every clock cycle, data is present on the _dat wires. A shift sequence begins with with a start bit on _dat[2], 
and proceeds from MSB to LSB. The _dat[2] input feeds a 17 bit command shift register. Likewise, _dat[0] feeds 
the even bits of a 32 bit data register, and —dat[1] feeds the odd bits of the 34 bit data register. The bits of the 
command and data registers are allocated as follows: 


Register Valid during what | Definition 
commands? 
Cnt Start bit 
Cmd[15:12 Command (see ScbCmd encoding.) 


Gd fi] Address [11:2] 


Cmd|0 All Match bit. Set by slave when command de- 
tected with an address matching a slave’s ad- 


Data[33:32] | All Indicates what acknowledgements are present 
on the data bus. See ScbDataAck encodings. 


Data|30:24 Slave number for Event ID 1. 
Datalz3.16 Event IDI 


Datal70 Event ID 0. 


32] 

DataBTD 

Data[30:2} | AddrH Address [30:2]. 
224 
:16] 


10.12.4 TWC9-+ Bit Sequence 


TWCO9A changes the protocol slightly to allow 64 bit slaves. It also still supports 32 bit slaves without forcing 
them to implement a fill 64 bit shifter, by insuring the first 32 shifts (with bits 64:33) can simply be dropped by 
32 bit slaves and still have everything work out. 


Every clock cycle, data is present on the _dat wires. A shift sequence begins with with a start bit on _dat[2], 
and proceeds from MSB to LSB. The _dat[2] input feeds a 33 bit command shift register. Likewise, _dat[0] feeds 
the even bits of a 66 bit data register, and —dat[1] feeds the odd bits of the 66 bit data register. The bits of the 
command and data registers are allocated as follows: 
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Cdl] Start bit 


esd 31:17] fd Reserved. Note 32 bit slaves shift past these 
an (o: and so cannot decode them. 


hia 16] Finished shift bit. 32-bit slaves complete com- 
mand shifting with this bit in what would nor- 
mally be the start bit. Therefore, the master 
sets this bit so the code may assert the shift- 
ing was properly completed. This bit may be 
stolen for other purposes if the assertions are 
removed. 


Cmd{[15:12] Command (see ScbCmd encoding.) 
Cmd/11:2] Address [11:2]. 


Cmd{1] Read, Write Double-word access. If a 32 bit slave sees this 

Cmdj0] All Match bit. Set by a slave when command de- 
tected with an address matching the slave’s 
address. 


Data|65:64] | All Indicates what acknowledgements are present 
on the data bus. Slaves do not decode these 
bits. See ScbDataAck encodings. 

Data|[63:0] Read, Write Data. 32-bit accesses have the oe replicated 
rane eS [ont tne upper and ower word 
Data[63:32] | AddrH Beseryed: Note 32 bit slaves shift past these 
een | tit eae decode theme 

Address BOI, 


63:32] | Count Reserved. Note 32 bit slaves shift past these 
bits and so cannot decode them. 


10.12.5 Commands 


The 4 bit command, enumerated in 10.14.3 decodes to the following operations: 


10.12.5.1 Idle 


The Idle operation is used during bus idle, and is ignored by all slaves. 


10.12.5.2 Reset 


The Reset op causes SCB slaves to clear the slave’s internal internal state, and is reached by continuously 
sending all ones on the _dat|0] input. Reset persists until a pure Idle (all zeros) is received. This allows slaves to 
be reset on a hang without losing register state. 
10.12.5.3 AddrH 

The AddrH op causes the last 32 bits of data shifted over _dat[1:0] to be loaded into the high bits of the address 
register. 
10.12.5.4 Write 

The Write op loads the low I/O address from the low 4 bits of the command shifter, and asserts a write request 


to the SCB user logic. When the SCB user logic accepts the write with ack_r, the slave passes the acknowledgement 
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to the SCB master by shifting a single pulse onto the —dat[1] output. 


10.12.5.5 Read 


The Read op loads the low I/O address from the low 4 bits of the command shifter, and asserts a read request 
to the SCB user logic. When the SCB user logic has the read data ready, it returns it to the SCB slave with 
an acknowledge. The slave acknowledges the Read to the SCB Master with a start pulse on the _dat output of 
_dat|[2:0]=3’b011, followed by 16 double-bits of read data. 


10.12.5.6 Count 


The Count op causes the high bits of the address to be compared to the slave’s write data register, and if 
matching, event data to be muxed onto the _dat([1:0] outputs. The two events being counted may come from different 
slaves, so two slave numbers are sent along with the Count op; either one matching will drive the appropriate _dat 
lines. Counting is “sticky” in that after the state machine returns to idle, it continues counting until the next —dat/2] 
start bit. 


10.13 Chip Reset 


On chip reset, all SCB master registers (except RAM) are cleared and counting is disabled. Software needs to 
clear the RAM by writing zeros to it during boot. 

During a SCB user driving the reset line into the SCB slave, that slave will ignore all SCB transactions, and 
that slave places its SCB bus is in bypass mode. This allows each slave to have a different reset, and all other slaves 
not in reset to still be programmable via the SCB. However, any slave’s reset must be deasserted only while the 
SCB bus is idle, to avoid decoding the first command incorrectly. 

Also during SCB user reset, a SCB slave will drive zeros on the write data wires. This allows SCB slaves to OR 
their CSR write enables with reset, so they will load the data bus and thus the zeros on reset. (Registers which 
affect the pins still need async reset, however.) This is more space and power efficient then using (a)synchronous 
resets on every data bit of every control register. 


10.14 Registers and Definitions 


10.14.1 Package Attributes 
Package 


chip_scb_spec 


Attributes 


-public_rdwr_accessors 


10.14.2 Definitions 
Defines 


SCB 


32’d32 DATAWIDTH Data Bus Width. Default width of data bus in bits. 


32’d7 SLAVEBITS Bits of address for unit number. Number of upper address bits that 
correspond to choosing which SCB Slave will be addressed. 


32’7d8 COUNTBITS Bits of counter events. Number of lower address bits used per-slave 
for counting events. 


32’d60 DELAY_NS_BC Speed register delay, best conditions. Nanoseconds. 
32’°d95 DELAY_NS_TYP | Speed register delay, typical conditions. Nanoseconds. 
327d190 DELAY_NS_WC | Speed register delay, worst conditions. Nanoseconds. 
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10.14.3 Command Enumerations 


ScbCmd specifies the bit encodings for the commands encoded in the first bits sent over the serial bus. 


Enum 


ScbCmd 


Tho000 | IDLE 
4’b0001 ADDRH Latch High Address. 


[Pp |__| Reserved +d 
[Pb [| Reserved + 
Prp1000 [|__| Reserved __——*d 
PPprox |__| Reserved ___———*d 
PPro [|__| Reserved | 


10.14.4 Data Ack Enumerations 


ScbDataAck specifies the bit encodings for the high two data bits. In addition for slave transactions, the MSB 
is the start bit, so must be set. 


Enum 


ScbDataAck 


Definition (if from Slave) | (Definition if from Master) 


2’b00 NONE NA - No start bit AddrH - No acks needed 
2’b01 NEED Read data 64 bit ack Need later acknowledgement from slave. 


2’b10 WRITE Write accepted First bit of write passed around loop, or last bit of count 
passed through loop 


PbI [READ | Read data 32 bit ack 


10.14.5 SCB Performance Events 


The following SCB internal events are trackable by SCB statistical event counting. 


Enum 


ScbScbEvent 


Attributes 


-descfunc 


8’ho0 CYCLES Core clock cycles. Always counts. 


8’hol CYCLES_D2 | Internal verification only. Repeats high for 2 cycles, then 
low for 2. 


8’ho2 MAGICO Internal verification only. Counts cycles where 


R_ScbPerfCtlMagicEvent|0] is true. 
8’h03 MAGIC1 Internal verification only. Counts cycles where 
R_ScbPerfCtlMagicEvent/1] is true. 


SHOES |__| Reserved. Retuwmns zero. SSS 
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10.14.6 Chip Revision Register 
Register 
R_ScbChipRev 


Attributes 


-kernel 


Address 


0xE_0800_0000 


[Bit | 


_ 16 | Features pins Feature bit. Bits in this region will be allocated to indi- 
cate optional features or enhancements, as they are spec- 
ified. Overlaps allowed. Bit 16 is 1 in ICE9A1 so we can 
determine proper mask selection. 


Product pins AddrProduct | Chip Product/Revision. Revision number of the chip 
product, returns ICE9, ICE9B, etc; incremented for each 
new major product. Use AddrProduct enumeration for 
exact values, see 16.6.4 on page 846. 

7:0 Rev R pins Minor Chip Revision. Revision number of the chip, 
bumped for different silicon passes or minor releases. This 
is metal-mask programmable. 


10.14.7 Chip Number Register 
Register 
R_ScbChipNum 


Attributes 


-kernel 


Address 
OxE_0800_0008 


Sepp} $a oe ee 


pear 11 ford aa flee number. Reserved for future use; written by mod- 
cee eal service processor, and read by software. 


coal (MspSlotId) | Slot ID number. Intended to be written over SysChain by 
module service processor, and read by software. Identical 
to MSP GPIO slot ID enumeration. 

Node pins Chip number on board (0-26). Hardcoded value from 


10.14.8 Chip Null Subcomponent Register 


This register is used for simulation purposes only. In real hardware it always returns 0. 


Register 
R_ScbChip Missing 


Attributes 


-kernel 
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Address 


OxE_ 0800_0010 


[Deinition —_SCSCSCSCSCSC~*d 


3l Cached ——————————EEE Cached. Used in C code to indicate register value 
has been cached. 


| 30:12 | 12 | | Reserved eee 


lll oll WR slaheaaaahaocalild Model has no Uart function. Verification use only, 0 on 
HW. 


10 Scb pins Model has no Scb Master function. Verification use only, 
0 on HW. 


Pre R pins Model has no Pre function. Verification use only, 0 on 
is We Me ee | 
Ocla R pins Model has no Ocla function. Verification use only, 0 on 
ea cc a Ec rec 
7 I2c R pins Model has no [2c function. Verification use only, 0 on 
a a a 
ete} = eee 
HW. 


a ee ee Model has no F! function. Verification use only, 0 on HW. 


R pins Model has no Dma function. Verification use only, 0 on 
HW. 
R 


pins Model has no Ddre/Ddro functions. Verification use only, 
0 on HW. 


2 Coh R pins Model has no Cac or Coh functions. Verification use only, 
Sa a ll 
1 Cpul5 R pins Model has no Cpul-CpuN functions. Verification use only, 
fee es Wee Wo | 
a ca ce a ca 

HW. 


10.14.9 Chip Speed Register 


R_ScbSpeed is used to determine the latency through a delay line to provide a very rough approximation of the 
speed of the part. Software hits the GO bit, then waits for the GO bit to clear. It then reads the Count value. 
This experiment must always be done in pairs: The first will measure one edge transition (say rising-to-falling), 
the second will measure the opposite transition. The numbers will differ by 15% or so. Both numbers should be 
reported. 


Register 
R_ScbSpeed 


Address 


0xE_0800_0020 


3l Go RW1CS Go. When written with one, set GO bit and start count- 
ing. After the delay is calculated, the go bit will clear and 
the new count will be visible. 


a 


Count DEY ine time. After Go completes, number of 
pelk cycles plus 2 taken to count a delay line of 
SCB_DELAY_NS_TYP ns. See the note about double 
measurements in the beginning of this section. 
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10.14.10 General Purpose IO Register 
Register 


R_ScbGpio 


Address 


0xE_0800_0040 


rao SSCdYCSSSC*dSCSC*diCR serve 


| Reserved 
19:16 | inp R pins GPIO input data. This may not match the output data 
when oe is asserted if a stronger driver is present on the 
input pin. Bit 0 reads value on sys_gpio (spare) input pin. 
Bits 3:1 reserved for future use. 


11:8 | oe RW GPIO output enable. If bit 0 set, drive sys_gpio (spare) 
pin with _data value. If clear, tristate. Bits 3:1 reserved 
for future use. 


3:0 data RW GPIO output data. Bit 0 value is driven to sys_gpio 
(spare) pin if _oe is set. Bits 3:1 reserved for future use. 


10.14.11 LED Register 
Register 


R_ScbLed 


Address 


0xE_0800_0048 


Ee 


RW LED status. If set, assert sys_led_l pin by enabling 
its open drain driver, pulling sys_led_l low. If not set, 
sys_led_l is hi-impedance. 


10.14.12 Attention Chip Register 


With the associated R_ScbAtnMsp register, the attention chip register provides a signaling interface between 
the Chip and MSP. 


R_ScbAtnChip forms a MSP to/from Chip communication channel in conjunction with the R.SysTapAtnMsp 
register in 12.6.15. 


To send data to the MSP, the chip implements the code in 10.2. 
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Algorithm 10.2 R_ScbAtnChip algorithm 
send_something() { 
do { rdata = read_of(R_ScbAtnChip) ; 
} while (rdata & bit(SendVld))); 
write_of(R_ScbAtnChip, 
bit(SendVld) | send_data) ; 
} 
receive_something() { 
rdata = read_of (R_ScbAtnChip) ; 
if (rdata & bit(RecvVld)) { 
write_of(R_ScbAtnChip, bit(RecvTaken)) ; 
// process data in rdata 
} 
// Else nothing to receive 
} 


// Better code could both send and receive data simultaniously. 


Register 
R_ScbAtnChip 


Attributes 


-kernel -writeonemixed 


Address 
OxE_0800_0060 


ae a as ia 
ce 


SEs RW ICE9B+ anewaitier Empty Interrupt Enable. Indicates chip in- 
terrupt should be asserted if SendVld is clear and more 
data may be sent. If clear, no interrupt. Note transmit- 
ter empty is the idle-state condition, so this bit should 
never be left on once all data is sent. First implemented 
in ICE9B. 

RecvInt Receiver Ready Interrupt Enable. Indicates chip interrupt 
should be asserted if _RecvVld is also asserted. If clear, 
no interrupt. 

28 RecvTaken | W1C Receive Data Taken. Write one to send to module pro- 
cessor indication that RecvData was accepted, and clear 
-RevVld. 


RecvVld Receive Data Valid. Valid flag from module processor, 
identical to R_ScbAtnMsp_SendVld. Indicates module 
processor data is valid to be read from RecvData. When 
data is accepted, chip writes _SendTaken. 

foal apoeaeel RW1CS ee Data Valid. Write one to set and indicate new send 

ee We ed data for MSP. Cleared when MSP takes the data. 
eal 0 | RecvData Receive Data. Overlaps SendData. 
If RecvVld is set, returns the next data to be received from 
the MSP. Note this is different data then that written. 
25:0 | SendData Send Data. Overlaps RecvData. 
If SendVld is simultaniously being written with a one, 
enqueues new send data for the MSP, and sets SendVld. 
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The ScbAtnInt register is used by the MSP to select what should assert the attention signal. 
This register should only be written by the MSP. (It would be a SysChain register, but leaving it in SCB space 
saves a significant number of syncronizer flops, as it must reside on a clock which is always running.) 


Register 
R_ScbAtnInt 


Attributes 


-Product=ICE9B+ 


Address 
OxE_0800_0070 


Definition 


ee ae ee 


Ea eo 


[cent 


a — iia — 


foe 


rtm a 


Leal 16 | CpuDMMask peer ICE9B+ 


oie | — ICE9B+ 


oan —— 
Le HO bac 


re fee hia 


Pf _ a — 
Prt EL — 


10.16 Debug Interrupt Register 


Register 
R_ScbDInt 


May 14, 2014 


ICE9B+ 


ICE9B+ 


| | TWC9A-4 | 


ime A esgee| 


es 


| | TWC9A- | 
ICE9B+ 


Niles | 


526 


Attention Asserted. True if the sys_atn pin is asserted, IE 
if any request bit is asserted and the corresponding mask 
is asserted. 

Non-Communication Attention Asserted. ‘True if any- 
thing other then _RxAtn or _TxAtn is asserting attention. 
This bit is duplicated in R_SysTapAtnMsp_NonComAtr 
to reduce polling in the MSP fast path. 

Reserved. 

CPU9:6 Debug Mode Mask. See _-CpuDMMask. 
Transmit Empty Mask. This is a read only copy o! 
R_SysTapAtnMsp_TxAtnMask; use that register to en- 
able/disable transmit interrupts. 

Receive Ready Mask. Enables _AtnRx asserting atten- 
tion. 

OCLA Debug Mode Mask. Enables _OclaDM asserting 
attention. 

CPU5:0 Debug Mode Mask. Enables corresponding 
-CpuDM asserting attention. Note bits for CPU6-9 are 
not contiguous, see the _Cpu6DMMask field. 

Reserved. 

CPU9:6 in Debug Mode. See -CpuDM. 

Transmit Empty. R_SysTapAtnMsp_SendVld is clear, in- 
dicating more data may be transmitted. 

Receiver Ready. R_SysTapAtnMsp_RecvVld is set, indi- 
erent eee data is ready to be received. 

OCLA Requesting Debug Mode. Asserted when the 
OCLA is requesting a Debug Interrupt; identical tc 
R_ScbDInt_OclaDM. 

CPU5:0 in Debug Mode. Asserted when the cor. 
responding CPU is in Debug Mode; identical tc 
R_ScbDInt_CpuDM. Note bits for CPU6-9 are not con- 
tiguous, see the _Cpu6DM field. 
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Attributes 


-Product=ICE9B+ -noregtestcpu 


Address 


O0xE_0800_0078 

Bmp SidSSd?CSCSC*zCEOB [Reed ——SSSSCSCS~—“~S*sSCSCS~“SsS 
2724 [Opp [R[x |_| TWO9AF | CPU9.6 in Debug Mods. Seo CpwDNE 
pat ICESBY [Reseed 
eta f—fr freee och ee Or a 


ib OclaToAll feel (ami ieee OCLA causes CPU Debug Interrupt. If set, when 
-OclaDM asserts, assert DINT to all CPUs. 


CpuToAll ee ea CPU Debug Mode causes CPU Debug Interrupt. If set, 
when any CPU enters debug mode and _CpuDM asserts, 
assert DINT to all CPUs. Thus when one CPU takes a 
debug execption, they all will. 


SendDInt | RW ICE9B+ Send CPU5:0 a Debug Interrupt. Set high to assert DINT 
to the specified CPU. (Note DINT is edge sensitive at 
the CPU.) After setting, poll on this register until the 
corresponding _CpuDM bit asserts, then clear this bit. 
Note CPUs 6-9 are not contiguous. 


| | ICE9B+ | | Reserved. eee 


po free SS aE EEE CIEE EE RETR Requesting Debug Mode. Asserted when the 
OCLA is requesting a Debug Interrupt. 


fe aaa aoe CPU5:0 in Debug Mode. Asserted when the correspond- 
ing CPU is in Debug Mode. Note CPUs 6-9 are not con- 
tiguous. 


10.17 Performance Counting Registers 


10.17.1 Interrupt Register 
Register 


R_ScbInt 


Attributes 


-kernel 


Address 


0xE_0800_0080 
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3l Interrupt asserted. Asserted to represent the interrupt 
output, namely whenever the given interrupt bit is on in 
this register, and the interrupt mask is enabled for that 
——HiE 


a 


ae ee Atteation Transmit Empty Interrupt. More data may be 
sent to R_ScbAtnChip. Send data to clear the interrupt. 
Note this bit resets to 1, as after reset the send buffer is 
empty and ready to transmit. 


jeaeal cate Lea Interrupt. Data is ready in RScbAtnChip. Ac- 
a the data to clear the interrupt. 


ihc aie aaa Performance Interrupt. A counter has _ over- 
flowed R_ScbPerfCtlIntBit. Write 1 to clear. 
R_ScbIntReq_PerfInt can be written to assert this 
interrupt. 


10.17.2 Interrupt Mask Register 


Register 


R_ScbInt Mask 


Attributes 


-kernel 


Address 


OxE_0800_0088 
ais, Ssd|sSSCd)CSCidSCti“‘“‘iR ON OCOC™C~“‘“S*S*S*S*S*S*S*S*S*S~S~S~*S 


al Restived: (Attention transmit empty interrupts are 
1 


maskable via the R-ScbAtnChip_TxInt register.) 


Reserved. (Attention Interrups are maskable via the 
R_ScbAtnChip_Int register.) 


PerfInt RW Performance interrupt mask. Enables R_ScbInt_PerflInt 
asserting an interrupt. 


10.17.3 Interrupt Request Register 


Register 


R_ScbIntReq 


Attributes 


-kernel 


Address 


0xE_0800_0090 
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pars sever 


Reserved. (Attention transmit empty ees can be 
created a the R_ScbAtnChip register.) 


Reserved. (Attention Interrups can only be requested by 
the MSP.) 


10 | Perfnt | WiCS/O0 | | a interrupt request. Asserts R-ScbInt_Perflnt. 


10.17.4 Performance Control Register 
Register 
R_ScbPerfCtl 


Attributes 


-kernel 


Address 
OxE_0801_0000 


BS aaa | [|__| Reserved. 


ah 11 SE SSE Model Magic events. For verification, allow creating 
of raw events trackable with ScbScbEvent-MAGICO and 
-MAGIC1. 


Addr Assert Model Magic address assertion. Fire an assertion on a 
read or write to a bad address. No function in silicon; 
reads to bad addresses always return OXFFFFFFFF re- 
gardless of this bit. 


NoInc Disable automatically incrementing the bucket. When 
clear, after each _Interval, increment R-ScbPerlBuckNum 
register. When set, always use the specified static 


R_ScbPerlBuckNum. 


8:4 IntBit RW Interrupt bit select. Bit number, that when gets set as- 
serts an interrupt. Thus the default of 31 will interrupt 
before a counter may overflow, and a value of 0 will in- 
terrupt when any event occurs (bit 0 asserts). Interrupts 
occur when the the count bit overflows, and don’t wait 
until the interval completes. Interrupts do not stop the 
counting. 

3:0 Interval RW 3 Sampling interval. Log2 number of cycles to spend on 
sampling each bucket. O0=32 cycles, 1=64 cycles, ..., 
15=1M cycles. Note setting a 1M cycle interval will re- 
quire nearly a second before the entire RAM is sampled, 
which will delay R_ScbPerfStat_Run clearing by up to a 
second. 


10.17.5 Performance Histogram Register 


Register 
R_ScbPerfHist 


Attributes 


-kernel 
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Address 


OxE_0801_0008 
Paap Cd dCi Reed SSSOSCSCSCSCSCSCS~*Y 


ae 0 sna Hisieean greater-than-equal value. Running experi- 
ments counting a “waiting-for” type of event, and varying 
HistGte, will give enough data to generate a histogram of 
latency versus probability. 
For each bucket, if R.ScbPerfBuckets_Hist is cleared, this 
register is ignored and that bucket counts cycles. 
If R_ScbPerfBuckets_Hist is set, and _HistGte == 0 gives 
unspecified results. (As it is meaningless to look for the 
times just a 0 to 0 transition occurs.) 
If R_ScbPerfBuckets_Hist is set, and _HistGte == 1, the 
bucket counts the number of occurances of the serial reg- 
ular expression 0+1-+, which is simply the number of pos- 
itive edges. 
If R_ScbPerfBuckets_Hist is set, and _HistGte >= 2, 
count one for ever time the event is high for >= 
R_ScbPerfHist’s number of cycles. LE. With _HistGte=2, 
count 0+11+. With _HistGte=3, count 0+111-, etc. 
If R_ScbPerfBuckets_Hist is set, and _HistGte == all ones 
gives unspecified results. 


10.17.6 Performance Bucket Number Register 
Register 
R_ScbPerfBuckNum 


Attributes 


-kernel 


Address 


0xE_0801_0010 
a 
| 15:8 | ha rae ae Recned, (for increasing number of buckets.) 


a 0 aE Bucket ot The current bucket being — This 
will automatically increment by 2 if counting is in progress 
and R_ScbPerfCtLNoInc is clear. Bit 0 is ignored, as 
counting is always done in bucket pairs. 


10.17.7 Performance Enable Register 


Register 
R_ScbPerfEna 


Attributes 


-kernel 
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Address 


OxE_0801_0020 
SS 


RWSL Enable sone Write one to start sampling/counting. 
Counting will continue as long as this remains set. Clear 
to end counting at next opportunity: when interval com- 
pletes on the last bucket or R_ScbPerfCtLNolInc and any 
bucket. R_ScbPerfStat_Run will clear when the final sam- 
ple is completed. 


10.17.8 Performance Status Register 
Register 
R_ScbPerfStat 


Attributes 


-kernel 


Address 
OxE_0801_0028 


Bali. | | [Rene —SOS—~—~—SC~“S~*Y 


run R Sampling is running. True when counting is 
active. The count ram will not have the most 
recent counts until this deasserts. 


10.17.9 Performance Bucket Configuration 


The R_ScbPerfBuckets registers contains the event number and controls for when the associated bucket is 
counted. 


Register 
R_ScbPerfBuckets[255:0] 


Attributes 


-noregtestcpu_reset -kernel 


Address 
OxE_0801_4000-0xE_0801_43FC 


EL OO 
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17:16 | ifOther RW FWO Count if AND other event. 00 or 11, normal 
operation. When 01, only increment the count 
in those cycles where this event and the op- 
posite bucket’s (odd bucket’s event for even 
buckets, even bucket’s event for odd buck- 
ets) raw event before applying ifOther or his- 
togramming is asserted. When 10, only count 
when the event AND NOT the opposite event. 
Note this only works when comparing against 
other events in the same clock domain. (See 
16.6.6 for the clock domain list, and note 
IfOther counting for two events in the same 
subchip ID is always ok.) 


15 hist RW FWO Histogram or count edges on the specified 
event. Otherwise if clear, count cycles where 
the event is asserted. 

See R_ScbPerfHist. This detection occurs af- 
ter the ifOther equation. 


14:0 | event RW FWO Event ID to count. Consists of the SCB slave 
number (see 16.6.6), concatenated with the 
8-bit event number inside that slave. Also 
see the AllEvent enumeration (in source only, 
not a spec). Events not specified return zero 
counts. 

In Twe9a and followons, if both pairs of events 
contain the special value AllEvent_INVALID 
(with encoding 0), this pair will not be sam- 
pled, and sampling will quickly continue to the 
next bucket. 

For a list of AllEvent (or Ice9_AllEvent) 
enumerations with descriptions, see 
“<project > /sw/include/sicortex/ice9/ice9_all_spec_sw.h”. 
These enumerations provide you all 15 bits 
for the “event” field of R_ScbPerfBuckets. 
Note that these enumerations don’t list the 
OCLA events that are available to count. 
See the On Chip Logic Analyzer chapter. In 
OCLA LAC, all the trigger-block triggers 
are available, after delays have are applied. 
In OCLA TRBCs, the outgoing triggers are 
available (like getting them from LAC but 
without delays). In OCLA TRBVs, the 32 
incoming data signals are available. 


10.17.10 Performance Count Ram 


The ScbPerfCounts registers contain the counts for each bucket, indexed by bucket number. 


Register 
R_ScbPerfCounts[255:0] 
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-noregtestcpu_reset -kernel 


Address 
OxE_0801_8000-0xE_0801_83FC 


31:0 | count RW FWO Performance counts. Number of cycles for 
which the given bucket’s event was asserted. 
For the read to include the most recent inter- 
val’s results, R.ScbPerfStat_run must be clear. 
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On Chip Logic Analyzer 


[$Id: chipocla.lyx 50693 2008-02-07 16:01:46Z wsnyder $] 


11.1 Overview 


The On-Chip Logic Analyzer (OCLA) provides debug capabilities for the processor segments and their associated 
L2 caches (PSX), the fabric switch (FSW), the DMA engine, the two coherence units (COHE and COHO) and 
the PMI unit. The OCLA is distributed around the chip and includes Capture Trace Blocks (CTBs), Trigger 
Blocks (TRBs), and a central controller (LAC). The trigger blocks come in two varieties; Codeword Trigger Blocks 
(TRBCs) and Vector Trigger Blocks (TRBVs). Some CTBs and TRBs have muxed inputs to allow larger numbers 
of signals to be sampled or triggered upon on a mutually exclusive basis. The CTBs, TRBs, and LAC are accessed 
via the Serial Configuration Bus (SCB). The module service processor may access the OCLA via the SCB hook on 
the SysChain. 


11.2 Differences, Bugs, and Enhancements 


11.2.1 Product and Chip Pass Differences 


1. 


ICE9B fixes GO->0 should shut OFF collection, bug2246. CollectTrace can be left ON by stopping an OCLA 
program that had not yet seen it’s trigger. CollectTrace can only be controlled by a running OCLA program, 
so you can’t shut it off by SCB writes. While Collect'Trace is ON, you cannot dump any CTBs. Workarounds: 
(a) A Diagnostics Dash script has been written that loads and runs a minimal OCLA program to shut off 
CollectTrace. (b) The OCLA dump program has been written to detect CollectTrace=ON, and exit with 
meaningful error message. (c) OCLA Dash scripts and all example OCLA programs have been written with 
a agraceful exita option, where a specific register-write tells it to shut CollectTrace OFF and stop watching 
for the trigger it didn’t get yet. 


. ICE9B adds new INCRBTH Opcode, bug2179. In ICE9A, although OCLA has 2 counters, you cannot count 


2 events concurrently, because if both happen on same clock there’s no way to increment both counters. 


. ICE9B enlarges counters from 12 to 16 bits, bug2244. 
. ICE9B fixes PMI CTB ExtMuxSel wired to TRBC, bug1959. The ExtMuxSel wires of OCLA PMI CTB were 


wired to the SCB register that’s supposed to control OCLA PMII TRBC. To workaround, write desired PMI 
CTB ExtMuxSel value to ExtMuxSel field in control register for PMII TRBC. Fortunately, PMI TRBC has 
no ExtMux, so this field is otherwise unused. Simplest solution without determining whether you have Ice9A 
or ICE9B is write desired PMI CTB ExtMuxSel value to both ExtMuxSel fields. 


. ICE9B fixes CAC trigger PrbState obscured by WtPrb2L2, bug1995. OCLA CAC TRBC mux=2 signals 


PrbState[2:0] had WtPrb2L2 OR-ed into PrbState[2]. To workaround, don’t use PrbState as a trigger, or 
only trigger on PrbState groups of state that you can identify with bits [1:0]. 


. ICE9B fixes CAC trigger WOHit/W1Hit instead of WOMiss/WI1Miss, bug2243. In ICE9A, both CAC 


Trigger Block and Collector Block hookups: (a) Change WOMiss/W1Miss to something better, perhaps 
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10. 
11. 


12. 
13. 


WOHit/W1Hit. Miss is including Idle and I/O. (b) Adjust flops so WOHit/W1Hit in same clock with related 
signals. To workaround, (a) qualify with not-Idle and not-IO. (b) Separately feed Hit and the other signals 
to LAC in separate triggers, then align them with Dly regs in LAC. 


. MIGHTFIX: TWC9A might fix OCLA to SCB uses LAC triggers, bugl1717. Passing OCLA events from 


trigger blocks to SCB Counters ties up LAC trigger configuration, usually preventing simultaneous OCLA 
use for other purposes. To workaround, accept that you are tying up OCLA with this. The cross connections 
between OCLA and SCB counting may not be used that much. You might prefer to count SCB events in 
SCB counters, and count OCLA events in OCLA counters. 


. MIGHTFIX: TWOO9A might allow trigger delays for blocks located in other than the CCLK domain, bug1854. 
. MIGHTFIX: TWC9A might add capture mux settings for the CPU program counter and L2<->L1 signals. 


NEED IMPL: TWC9A might add capture mux settings for the FSW links 1 and 2, bug2232. 


MIGHTFIX: TWO9A might fix DMA CTB qualifier in wrong clock, bug2193. In DMA’s hookups to OCLA, 
the ue_xxx_DbgValid_c2a signal is sent into the trigger block and CTB, when really it should be delayed by 
two more cycles. In the CTB as a qualifier we pretty much cannot use it, because you want to use it in 
combination with other signals like DobgThread_c4a and DbgPc_c4a. To workaround, only do un-qualified 
collection in DMA CTB. In DMA trigger block, send it and other signals separately on the 2 triggers to LAC, 
where the Dly regs can align them. 


MIGHTFIX: TWC9A might add a WtAddr sticky overflow bit, bug2207. 
MIGHTFIX: TWC9A 


11.2.2. Known Bugs 


1. 


Overflow bits still set as OCLA starts, bug1825. OCLA’s automatic clearing of counter overflow bits when 
you start LAC program is delayed a clock or two. Early instructions in LAC program can falsely trigger on 
overflow depending on the previous use of OCLA. To workaround, never branch on Counter Overflows in first 
2 instructions of any LAC program. 


. C CTB WtAddrClr triggered by any address in CTB, bug2026. Writing 0x10 to any SCB register address in 


a particular Ocla CTB can trigger WtAddrClr (clear write address reg). This even includes unused addresses 
within the SCB address space of a CTB. To workaround, never write any of the read-only registers. 


11.2.3 Possible Enhancements 


1. 


Make both LAC counters 32-bit (currently 16-bits plus sticky overflow bit). There’s only one instance of the 
LAC, so this is very affordable. We’ve wanted bigger counters when writing LAC programs, and unanticipated 
but valuable use of OCLA as a highly-configurable counter would benefit from full 32-bit counters. 


. Separate “GO” Register. When you write OCLA management software for one of Ice9’s embedded processors, 


or for the external SSP, you tend to write one function that configures OCLA ahead of time, and another 
function to tell OCLA to “GO” at roughly the right moment. Currently the GO bit shares register R-LacCtl 
with some configuration fields that need to be written correctly for what you want OCLA to do. This 
contributes to messy software design in that you must have handy the values to write to those fields when you 
write a 1 to GO to start the LAC program. It would be nice if all OCLA configuration could be encapsulated 
in, and completed by an OCLA configuration function. 


. If SCB reg addresses are cheap, consider breaking R_LacCtl into 3 or 4 registers by type of access, making 


software easier to write. 


. Collect ON/OFF by Register Write. Provide a super-simple alternative to writing a LAC program, for when 


exact timing of collection is not critical. Provide one or two registers that allow you, by SCB register write 
alone, to turn on and off CollectTrace to the CTBs. This allows someone with minimal knowledge of OLCA 
to quickly collect some trace information and read it out, just by doing easy-to-understand SCB writes and 
reads. Some semi-steady-state activities can be viewed at an arbitrary time, or you could try more than 
once till you see it. Or, for more accuracy, you could have Ice9 embedded processor code trigger collection at 
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roughly the right time, and rely on the 1024-entry size of the CTBs to give you a pretty big window to land 
in. These reg writes would the same logic as the SETCOLL and CLRCOLL opcodes from LAC. 


5. Trigger by Register Write. There are ways to do this now, but they’re a little obscure. ’m suggesting a 
very-simple up-front way to trigger your LAC program by writing an SCB register in LAC who’s sole purpose 
is to do this. Aggregate Mask and Match bits 0 and 1 are available, so why not have them driven directly 
from such a register. 


6. Clarify When CTB Has New Contents. Currently it’s a little hassle to do sanity checks that your CTB really 
got new contents from running your LAC program. Especially when you are wondering if you configured 
everything correctly. You can “trust that a good-status completed LAC program means you have new CTB 
contents”. You can alternate the CTB’s external mux between what you want to collect and something else, 
then read-out the CTB and see that contents changed. 


7. CTB Zeroing. An SCB-register “ClearCtb” action-bit in each CTB, that would zero-out the CTB (taking 
1024 clocks). This bit could be readable and self-clears after the 1024 clocks have passed, so it’s safe to start 
a new collection. 


8. StopOnFull Final Address. Currently, in StopOnFull mode, when the CTB gets full and stops collecting, the 
final address is 0x000, which is the same address it would have if it never started! Either change this to stop 
at Ox3FF, or have a sticky overflow bit which clears when you write WtAddrClr in R_CtbxColCtl. 


9. StoppedOnFull Status Bit. If in StopOnFull mode, have a read-only bit StoppedOnFull in R-CtbxColCtl. 
This signal already exists in the CTB Verilog code. 


10. Fix the “Collecting” Status Bit. Bit “Collecting” of R-CtbxColCtl is directly flopped off of lac_ctb_CollectTrace_cOa, 
which means it doesn’t take into consideration a CTB in StopOnFull mode that has become full. Reading of 
the CTB works in that case. Change Collecting to be false if StopOnFull and full. A signal with this info 
is available in the CTB verilog code. You might also consider having “Collecting” read back as 0 when En- 
ableCollect==0. To be able to see the level of signal lac_ctb_Collect'Trace_c0a clearly in one central place, add 
read-only bit “CollectTrace” to R-LacCtl (or if R-LacCtl gets broken-up into several registers as suggested, 
put this bit in whatever register contains the other read-only fields). 


11. Have OxFFFFFFFF Indicate Bad Read. If you try to read the contents of your CTB when you cannot, 
you currently get all-zeros. All-zeros can mean you never collected anything, and also for some units it’s 
a likely read-result if you collected during an idle time. A tiny change in the verilog could make it return 
OxFFFFFFFF’s for reads when you can’t read the CTB. This would be clearly different than a failure to 
trigger collection, and is an almost-impossible long series of values for any CTB to collect. 


12. Stopping LAC Stops Collection. Have a transition of the GO bit 1 -> 0 cause the CLRCOLL action. This 
eliminates the hazard of someone stopping the LAC program manually by clearing the GO bit, but then being 
unable to read any CTB contents because CollectTrace is still ON. Have this be by 1 -> 0 transition, not by 
GO==0, so we can have the previously-mentioned registers that turn on and off collection. The way OLCA 
is now it can be very irritating if you happened to shut off LAC by writing 0 to the GO bit when collection 
was ON. There’s no straightforward way to shut off collection of all enabled CTBs by register-write, you can 
only shut them off by opcode CLRCOLL in a running LAC program. This is no problem when the next 
LAC program you wish to run is of the CTB StopOnFull=0 unqualified style, but if you are doing qualified 
collection with StopOnFull=1 and you want to start at CTB address=0 it can be a problem. You might think 
you could just begin every LAC program with a CLRCOLL and your problems would be solved, but there’s 
no way inside a LAC program to clear a CTB’s WtAddr. 


13. Move Delay Registers into the Trigger Blocks. Having the Delay Registers centralized in LAC means they’re 
all flopped in cclk domain. FSW triggers and trigger blocks are in sclk domain. To be able to line-up FSW 
signals into a complex trigger is hard, although this was partly solved by providing some FSW trigger signals 
to it’s trigger blocks more than once, with different sclk delays. The best solution to this is to have the delay 
registers in the Trigger Blocks, not centralized in the LAC. 


14. More External-Mux Values, or Extra Mux in FSW. Boost the number of bits to control external muxes from 
3 to 4 or 5. Do this for all types of trigger and collector blocks. Almost no extra logic is created by this except 
in those blocks where the extra external-mux-value options are used. The motivation for this is with regard 
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15. 


16. 
17. 


18. 


19. 


to the Link side of FSW. Currently OCLA in FSW only looks at FLR-0 and FLT-0 signals, due to mux-value 
limitations. For better board and system debug, to use OCLA freely to see damaged traffic arriving any one 
particular link, we really want all 6 links covered by OCLA. (b) Another way to get all 6 Link interfaces in 
FSW into OCLA, without changing OCLA Trigger or Collector blocks, is to just put a new register into FSW. 
This register in FSW’s register address space would take values of 0, 1, or 2, and would drive a first level 
of muxing, selecting which link-number provides FLR and FLT signals to the current OCLA-register-driven 
external muxes. 


More Collection Qualifiers. CTBs currently allow up-to 2 Qualifier signals. In some uses of CTBs there were 
more signals that would be handy to have available as qualifiers. The external mux selecting data for a CTB 
often selects between a good number of unrelated interfaces. In a number of cases you just accept that you 
have to do un-qualified collection, because the 2 qualifiers provided are not relevant to the interface or signals 
you are looking at. 


More CTB Qualifier Inputs. Perhaps 4. 


Use External Mux on Qualifiers. When instantiating CTBs, follow the example of how FSW Vector Trigger 
Blocks are instantiated, where the external mux selectors vary both the data and the qualifier to be used. 


Eliminate Qualifiers in Codeword Trigger Blocks. The way Codeword Trigger Blocks work, all the trigger 
inputs are effectively qualifiers on each other. There’s no reason to handle some inputs differently and call 
them “qualifiers”. 


Widen Vector Trigger Blocks to 64-Bits. FSW is really the only place where Vector Trigger Blocks are used, 
because the way they’re used in DMA is more naturally served by Codeword Trigger Blocks. In FSW the 
natural width of the busses looked-at is 64 bits. It would be a usage simplification if the Trigger Block just 
looked at the 64 bits. 


11.3. Description 


In the ICE9 implementation, the OCLA units spread over the chip are: 


1 LAC central controller. 

6 CTBs — One for each of the six processor/L2 cache segments (PSXs). 

2 CTBs — One for each of the two coherence engines (COHE and COHO). 
2 CTBs — In the FSW unit. 

1 CTB — In the PMI. 

1 CTB —- In the DMA Engine. 


12 TRBCs — One for each of the six PSX segments, plus two for the PMI, plus one each for the COHE, 
COHO, DMA, and FSW. 


3 TRBVs — Two in the FSW, and one in the DMA. 


All CTBs are 1024 entries deep by 33 bits wide. 
The number of different sets of signals you can choose to collect is quite large, selected by External Mux settings 
in each CTB: 


PSx CTBs: 3 mux settings * 6 PS’s = 18 sets of signals. 

COHx CTBs: 4 mux settings * 2 COH’s = 8 sets of signals (plus free-running counter). 
FSW Input CTB: 5 sets of signals. 

FSW Output CTB: 5 sets of signals. 


DMA CTB: 4 sets of signals. 
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TMatch<31:0> > Delay [31:0] | 


Aggregate Match[4:0] 


OverFlow0 


OverFlowl 

Addr<11:0> 
Load FSM RAM 
Start 
Stop 
iner 

Data<9:0> 

Opcode<4:0> State<4:0> 
ocla_slow_int 
» dbg_int 
Opcode 


Decoder 


i 


CSRs and SCBS 


» sys_ocla_trig_p 
CollectTrace 


Figure 11.1: The On Chip Logic Analyzer Control Unit (LAC) 


e PMI CTB: 7 sets of signals. 


For a total of 47 sets of signals. 


More than one CTB can be enabled for collection at once, although this only makes sense if you can arrange to 
have a window of time during which both CTBs are collecting meaningful events. 


11.4 Package Attributes 


Package 
chip_ocla_spec 
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11.5 LAC Signals and Innards 
11.5.1 What LAC Does 


The main purpose of LAC (and your LAC program, and the values you write into LAC registers) is to control 
the CollectTrace signal (lac_ctb_CollectTrace_c2a) leading to all the Collector Blocks. When CollectTrace is ON, all 
CTBs (collector blocks) will collect values in the manner in which each has been configured. When CollectTrace is 
OFF, all CTBs are not collecting. 

Secondary purposes of LAC are to set the Debug Interrupt, set the Slow Interrupt, set the 2 readable Flags, 
and to provide final status information to the user by ending at different addresses which can be read from the 
R_LacCtl register. 


11.5.2 LAC Innards 


The LAC provides the coordination and recognition of the actual trigger event. In most cases, logic analyzer 
triggers are more complicated than “fire when you see address X on bus Y.” Instead, they frequently take the 
form of “fire on event A followed by event B followed by event C, but reset the recognizer on event D.” This event 
recognizer is a state machine. I have no idea what sequences will be useful at this time, and I doubt any apriori 
guess is worthwhile. That being the case, the LAC is implemented as a field programmable state machine. (Don’t 
worry, this isn’t as complicated as it sounds.) The state machine may have up to 32 states. 

The LAC has 32 trigger event inputs. Each input is synchronized and passes through a programmable delay 
chain that imposes between 2 and 7 cclk cycles of delay. The 32 bit vector that pops out of the array of delay 
chains is compared (using value/mask pairs similar to those in a vector comparison TRB ~ see section 11.8) in five 
aggregate event comparators. This a five bit wide “trigger event vector.” 

The LAC also contains two 16 bit counters (12 bits in Ice9A). Each counter is loaded with an initial value that 
is scanned in when the LAC is started. The initial value is written to counter X when the state machine selects the 
LOADx opcode or whenever the ’Go’ bit in the LAC Control Register is set to one. When the counter overflows, 
it sets the CTRxOFLO bit. This bit is sticky; it stays set until either the recognizer asserts STARTx or LOADx 
again or the ’Go’ bit in the LAC Control Register is set to one which forces a LOAD to both counters. Figure 
11.1 shows the outline of the control unit. The FSM RAM holds 4K ten bit instructions. An instruction consists 
of both an opcode and a next state. The LAC is configurable to implement any state machine possible with seven 
inputs, five outputs and thirty two states. 

Aggregate Match inputs are use to consolidate multiple trigger inputs to the LAC into a single pattern to be 
matched. See 11.5.3.3 on page 542. AMatch|x] is true if TMatch/31:0] & AMask[31:0] == AMatch[31:0]. 


11.5.2.1 LAC to SCB-Performance-Counters 


All triggers coming into LAC from Trigger Blocks are also provided to the SCB Performance Counters mechanism 
as events to be counted. 

To select a trigger from LAC to count in SCB Performance Counter, program the event field in R-ScbPerfBuckets, 
as described in the Serial Configuration Bus chapter. In “event” put the SubChipId for LAC (from the Addressing 
chapter) and 8 bits saying which one of the 32 triggers you want (bits 7,6,5 are zero). 

These triggers are provided to SCB Performance Counters after being delayed by LAC’s delay registers, but 
before being combined into t0 - t4. These delays allow SCB Performance Counters to condition one event on another 
event with a corrective skew between the two, in case the signals are related but occur one or more clocks apart. 
The conditioning is done by logic within the SCB Performance Counters mechanism. See the Serial Configuration 
Bus chapter for how to do this. 

To provide these events, LAC hardware uses the performance counter feature of it’s embedded SCB slave. The 
slave provides two input signals (x_scbs_event_x[1:0]), and a mux select for each (scbs_x_eventId{0|1}_xr). The 
LAC uses each mux select to choose one from among the 32 synchronized and delayed trigger inputs as specified 
above. 

How much does this limit simultaneous normal use of OCLA? A little bit. One or two trigger blocks (and their 
delays) would be configured in the manner needed for SCB Performance Counters. An OCLA program could ignore 
them, using triggers from other blocks, but if it wants trigger from those blocks, they must use them with the same 
configuration and delays needed by SCB Performance Counters. Each Trigger Block has 2 trigger outputs, so if 
SCB Performance Counters only needs one of them, the other could be configured as needed for OCLA, although 
the external mux setting would have to be the same for both uses. 
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When counting events from trigger blocks in a different clock domain than LAC, like from FSW, it’s better, 
when possible, to get the events directly from those trigger blocks. SCB Performance Counters has a way of getting 
correct counts from SCB slaves in different clock domains, whereas the clock-domain crossing from trigger blocks 
to LAC is not so nice. The PulseStretch mechanism for making sure LAC sees a trigger from a faster-clock trigger 
block is fine for triggers, but poor for counting. If you must get your counts from LAC, consider using OCLA 
PulseStretch along with the “transitions counting” option in SCB Performance Counters. 


11.5.2.2 SCB-Performance-Counters to LAC 


The 2 events configured to be counted by the SCB Performance Counters mechanism are also provided to OCLA 
for triggering. See the ScbTrig0En and ScbTrig1En fields in R_LacCtl. These are OR-ed into triggers tO - t4, after 
trigger-block triggers are masked and matched, but before possible inversion by R_LacCtl field InvAgMat. 

If you do this you’ll have to manage your SCB Bus use. As explained in the Serial Configuration Bus chapter, 
anytime you are doing SCB writes or reads the detections of Performance Counter events are temporarily shut off. 
Even something as innocent as polling R_LacCtl.Flag to see whether OCLA got the trigger and collected will create 
blackout periods that could hide the very trigger you are waiting for! 

How much does this limit simultaneous normal use of SCB Performance Counters? A lot. If you configure for 2 
events, SCB Performance Counters would be limited to counting these events only. If you configure for one event 
from SCB Performance Counters to affect LAC programs, you still have some flexibility for unrelated use of SCB 
Performance Counters. 


11.5.2.3 LAC Operation Codes 
Enum 


LacOp 


[sho | NOOP | DoNothingmParialar ——sidYSCSC—~*™ 
[sha | SETEXTP | Set Bictemal OCLA tigger owtpar pin [|_| 
CLREXTP| Clear Extomal OCLA trigger ontpat pin |_| 
SETCOLL | Set CollectTrace output | id 
CLRCOLL | Clear Collect Trace outpat Si SiS 
SETFLO [Set Fagin CSRS 
GURFLO_| Clear Fag 0m CSR_——SSCS~dC 
SHTFLI_[ Set Fag in CSRS 
GURFLI | Clear Fag tm CSR_—SSC—~s 
[She | SETDBI_[ Set Debug Interrupt output +P id 

The 

hin 

Fi 

Fh 

his 

Tid 

IS 

TRIG 

5’h17 Increment Counter 1 


5’h18 INCRBTH Increment Both Counters 


11.5.2.4 Be Sure To Shut Off CollectTrace 


If your LAC program will be or might be used on Ice9A chips, it needs to shut off CollectTrace before the 
program finishes or is stopped by register-write. Otherwise it may (a) cause you to get all-zeros when you read the 
contents of a collector block, of collector-block contents, or (b) cause premature data collection during the next use 
of OCLA. This is fixed in Ice9B and later, but in Ice9A, stopping the LAC program does not shut off CollectTrace. 
In Ice9A it can only be shut off by a LAC program instruction. See the “CTB Innards” section below for more 
details. 


May 14, 2014 541 Rev 51328 


SiCortex Confidential CHAPTER 11. ON CHIP LOGIC ANALYZER 


11.5.3 LAC Registers 
11.5.3.1 The Control Register 
Register 


R_LacCtl 


Attributes 


-writeonemixed 


Address 


0xE_6800_0000 


31:27 | ScbTriglEn rRW [Oo Uf OR scb_ocla_event_cr[1] into AgMatch|[x] 
26:22 | ScbTrigQEn PRwW [Oo] OR scb_ocla_event_cr[0] into AgMatch|[x] 


21:17 | InvAgMat RW Invert sense of AgMatch. 
When [x] is True, AgMatch[x] = ((TrigIn[31:0] & AMask[31:0]) 
!= AMatch[31:0]) 


DbgInt RWI1C fof Debug Interrupt to MIPS Cores 


RWIC [0 |__| Slow Interrupt output —SSCSCS~S~S~*@Y 
FSMAddr | R [0 |__| Current state of Address input to FSM RAM 
pea [Mag sR [0 | Readable flags from the FSM Outputs 


Go RW When TRUE, FSM is sequencing. 
When Go is 0, the STATE is set to 0 and the opcode is 0 
(NOOP). 
When Go transitions to 1, the initial STATE is 0. 


Be careful, when writing GO=1 to start the LAC program: That same register-write must contain your desired 
configuration values for ScbTriglEn, ScbTrig0En, and InvAgMat. 


11.5.3.2 The Delay Registers 


Each input trigger passes through two levels of CCLK flops (as a synchronizer). Each trigger then can be 
delayed by from 0 to 5 additional CCLK cycles before passing on to the AggregateMatch comparators. See “Uses 
for the Delay Registers” subsection of “Hints for Using Trigger Blocks” section later in this chapter. If you put a 6 
or 7 in, you get a delay of only 5. 


Register 


R_LacTrgDly[31:0] 


Address 


0xE_6800_0100-0xE_6800_017f 


Os a 


11.5.3.3 The Aggregate Mask Registers 
Register 


R_LacAggMsk[4:0] 
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Address 
0xE_6800_0600-0xE_6800_0613 


Mask [RW [0 |_| Full mask register (Ovellpsallowed)——SSS—*d 
| 31:30 | TrbcPs5 =| RW [0 | Processor Segment 5 Codeword Triggers (Overlaps allowed 
| 29:28 | TrbcPs4 | RW [0 [| | Processor Segment 4 Codeword Triggers (Overlaps allowed 
| 27:26 | TrbcPs3 [RW [Oo | | Processor Segment 3 Codeword Triggers (Overlaps allowed 
| 25:24 | TrbcPs2 | RW [0 | | Processor Segment 2 Codeword Triggers (Overlaps allowed 
Processor Segment 1 Codeword Triggers (Overlaps allowed 
| iiven Coherence Unit Codeword Triggers (Overlap allowed) | 
Odd Coherence Unit Codeword Triggers (Overlaps allowed 


ae 
eee oo 
a 
rw fo 
a a A 
Paw fof 

Ponerinae Gl Fabric Switch Control/Status Codeword Triggers (Overlaps al- 
lowed) 

a a 
ao 
Paw fo 
a a 
a 


Sasa 
Oe le = = =. Reserved (Overlaps allowed) 


11.5.3.4 The Aggregate Match Registers 


6 
4 
2 
0 


Description 


Match against incoming masked delayed triggers. Aggregate match X occurs with (DelayedTrigger[31:0] & 
Mask[X]) == Match[X]. Defaults to nonzero value so that the match always fails until configured. 


Register 
R_LacAggMat|4:0] 


Address 
0xE_6800_0640-0xE_6800_0653 


Paid | Match | RW | Oxi |__| Full match register (Overlapa allowed) ———S——dY 
31:30] ‘TrbePss | RW | 0x3. Processor Segment 5 Codeword Triggers (Overlaps allowed 
| 29:28 | TrbcPs4 | RW 0x8] Processor Segment 4 Codeword Triggers (Overlaps allowed 
| 27:26 | TrbcPs3 [| RW | 0x3 | Processor Segment 3 Codeword Triggers (Overlaps allowed) 
| 25:24 | TrbcPs2 | RW [0x3] Processor Segment 2 Codeword Triggers (Overlaps allowed 
| 23:22 | TrbcPs1 | RW [0x3 | | Processor Segment 1 Codeword Triggers (Overlaps allowed 
21:20" |)" TrboPsO —— [RW =| 0x3 | Processor Segment 0 Codeword Triggers (Overlaps allowed 
| 19:18 | TrbcCohe | RW [0x3 | | Even Coherence Unit Codeword Triggers (Overlaps allowed) 
| 17:16 | TrbcCoho [RW | 0x3 |] Odd Coherence Unit Codeword Triggers (Overlaps allowed 


| 15:14 | TrbvFswo | RW [0x3 || Fabric Switch Output Vector Triggers (Overlaps allowed) 
13:12 | TrbvFswi [RW | 0x3 | | Fabric Switch Input Vector Triggers (Overlaps allowed) 


11:10 (aan (cae (Maa |S Fabric Switch Control/Status Codeword Triggers (Overlaps al- 
lowed) 


| 9:3 | TrbyDma [RW | 0x3 |] DMA MicroEngine Vector Triggers (Overlaps allowed) 


P76 _[TibeDma [RW [0x3 |__| DMA CSW Bus Stop Codeword Twiggers (Overlaps allowed) _ 
a [TibePmit [RW [0x8 |__| PMI Tntomnal Signal Godeword Triggers (Overlaps allowed) — 
[32 [TibePmi [RW [0x3 [___| PMI CSW Bus Stop Codeword Triguers (Overlaps allowed) 
prof RW [0 Reserved Overtaps alowed) 


May 14, 2014 543 Rev 51328 


SiCortex Confidential CHAPTER 11. ON CHIP LOGIC ANALYZER 


11.5.3.5 The Initial Counter Value Registers 
Register 
R_LacIniCtr[1:0] 


Address 
0xE_6800_0700-0xE_6800_0707 


15:0 | Init ValB RW ICE9B-+ | Value to be loaded into counter [x] when Reload[X] is true in 
ICE9B or later. 
(Overlaps allowed) 


Init Val RW ICE9A Value to be loaded into counter [x] when Reload[X] is true in 
ICE9A or ICEQA1. 
(Overlaps allowed) 


Note: In ICE9A and ICE9A1, bits 15:12 don’t exist, will ignore writes, and read-back 0. 


11.5.3.6 The Current Counter Value Registers 
Register 
R_LacCtr[1:0] 


Address 
0xE_6800_0710-0xE_6800_0717 


3l OverflowB ICE9B+ | The “current” state of the counter’s overflow bit in ICE9B 
or later. 
Sets when bits 15:0 roll over. Won’t clear if they roll over 
again. 


| 30:16 | 7 ae | «|: Reserved 


Gea a et 
Eo 0 Lay a ee “current” state of the counter in ICE9B or later. 
Overflow ICE9A i “current” state of the counter’s overflow bit in ICE9A 
or ICE9A1. 
Sets when bits 11:0 roll over. Won’t clear if they roll over 
again. (Overlaps allowed) 


Count fee Wace The “current” state of the counter in ICE9A or ICE9A1. 
(Overlaps allowed) 


Note: The actual sizes of the counters al the above fields for the stated versions of ICE9. 


11.5.3.7 The FSM RAM 
Class 
LacRamAddr 


Bit 


OverFlowl Counter 1 Overflow 


OverF low0 Counter 0 Overflow 


9:5 FSM Next State 
AgMatch Aggregate Match 


Register 
R_LacRam|4095:0] 
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Address 


0xE_6800_4000-0xE_6800_7fff 


[os [State [W [0 | | NedsaeinterSM OOS 


P40 [Opeode [wf | | tAC Opeade SSCS 


11.5.4 LAC Signals 


The LAC contains its own SCB slave unit. It runs in the CCLK domain. Table 11.2 shows the various LAC 
input and output signals. 


Active-low reset, which deasserts synchronous with cclk. 


(16x) trbN_lac_Trig_x2a[1:0] | various | In | Trigger block asserts this signal when the trigger condition is 
met. This must be synchronized to CCLK domain by the LAC. 
The synchronized and delayed version of these signals are also 


connected to the event wires of the local SCB slave 


lac_xxx_SlowInt_c2a Connected to the slow interrupt 
lac_xxx_DbgInt_c2a Connected to the MIPS debug interrupt 
lac_xxx_Ext Trig_c2a External trigger pin (sys_ocla_trig) 

i ) 


scb_ocla_event_cr[1:0] cclk 


TE 
_r[2: 
:0 


SCB ID (tied to AddrSubld::OCLA in BBS 


lac_ctb_Collect Trace_c2a cclk The LAC produces a single active-high signal telling all capture 
blocks to record data to their ring buffers 


chaini_ctb_dat_r[2:0] | cclk =| In | Serial chain SCB input 
ctb_chaino_dat_r[2:0 | cclk =| Out | Serial chain SCB output 


Table 11.2: LAC Signals 


Out 
In 
In 
In 

Out 


11.6 Collector Blocks (CTBs) in general 


This section describes what’s common to all Collector Blocks. The signals collected by each individual Collector 
Block are described in later sections. 

Each CTB is a trace buffer that is as large 32 bits wide and 1K entries deep. The actual size is configured 
based on the space available near the block. (Only the array size changes, all control registers are wide enough to 
accomodate a 32x1K trace memory.) The trace buffer data inputs are connected to the data stream that we want 
to observe. The trace buffer write port runs off the same clock that sources the observed data stream. Figure 11.2 
shows the outline of a CTB. Its primary inputs are the SampleDataIn[31:0] signal and the CollectTrace input that 
indicates the trace buffer should collect data. When the central controller (LAC) detects that the trigger event has 
been satisfied, it will assert or deassert Collect'Trace at the appropriate time to all the CTBs on the chip. At the 
deassertion edge of CollectTrace, the WT Addr in each CTB will be frozen. The CollectTrace signal from the LAC 
is timed to the L2 cache clock — cclk. CTBs connected to other clock domains are responsible for synchronizing 
this input to their own domain. 

Capture blocks (CTB) are instantiated in or near the unit whose data they will sample, and they are clocked 
by the same clock as the data to be sampled. In the description below, I will use “xclk” to represent the local clock 
domain. 


11.6.1 CTB Innards 


Each CTB contains its own SCB slave, since this keeps things reasonably simple, and the size of the SCB slave 
is small compared to even the minimal CTB configuration. 
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Figure 11.2: On Chip Logic Analyzer Capture Block (CTB) 
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11.6.1.1 The Control Unit and Muxes 


The Control Unit contains the trace collection control register and is responsible for sequencing writes and reads 
from the trace RAM. It also recognizes dead collection cycles and manages the dead cycle counter. 

The Output Mux selects between the low 32 bits of the trace RAM, the top bit plus the low 31 bits of the trace 
RAM, the WT Address register, or the contents of the collection control register. The choice is determined by the 
SCB register address. 


11.6.1.2 The WT Addr Register 


The WT Addr register can be cleared by the Control unit (see 11.6.2.1) and increments each time we write a 
sample or dead cycle count to the trace RAM. 


11.6.1.3 The Dead Cycle Counter 


Not all samples are worth collecting. All collector blocks except the one in PMI have a “qualified collection 
mode” (see 11.6.2.1). 

When qualified collection is enabled, the trace will include counts of cycles in various locations instead of 
collector signals data. Trace entries that are cycle counts are marked by setting bit 32 in the trace RAM to 1. 
We can read bit 32 by reading the “topbits” register range. When the qualifying condition is not met, we are not 
collecting trace samples, instead we increment the DeadCycle counter on each such cycle and store it in the collector 
block memory without advancing the write address. Once a “qualified” clock occurs, write address is advanced and 
the normal collection data is stored. The dead cycle counter is cleared each time a new qualified sample is recorded 
into the trace RAM. This compacts or collapses what’s stored in a collector block, allowing events over much more 
than the usual 1024 clocks to be observed. 

The dead cycle counter is only 16 bits. Whenever it rolls-over, a OxFFFF is stored, and the write address is 
advanced. 


11.6.1.4 A Dead Cycle Counter Bug 


(a) Dead Cycle counts are 1 too high. The smallest Dead Cycle count you'll see stored in a CTB is 2, which 
means 1 non-qualified clock. The largest you'll see is OxFFFF, which means OxFFFE non-qualified clocks. 

(b) After rollover, after storing the OxFFFF, the Dead Cycle counts stored are 1 too low. The smallest Dead 
Cycle count you'll see stored in a CTB is 0, which means 1 non-qualified clock. The largest you’ll see is OxFFFF, 
which means 0x10000 non-qualified clocks. 

These corrections to what you read from a CTB apply to the usual LAC programs you are likely to write, 
where the LAC program has left collection turned on for a medium or long period of time, and the storing of dead 
cycle counts in the CTB is being controlled by the selected qualifier signal turning on and off. If you write a LAC 
program that turns on collection for a short period of time, and qualification is not met during that entire time, 
the stored dead cycle count will be correct. For example if you enabled collection for 5 clocks, and qualification 
was never met, you'd get a “5” stored. 


11.6.1.5 The Trace RAM 


The Trace RAM is configurable, and is at most 33 bits by 1K entries. In all cases, the width of the RAM is 1 
bit wider than the input sample, to allow recording of “dead cycle” markers. 


11.6.1.6 When Can You Read CTB Contents? 


One of 3 conditions must be be true for you to read-out the CTB contents with SCB reads: 

(1) Your LAC program has shut OFF CollectTrace. In Ice9A this can only be done by a LAC program 
instruction, no register write can do it, and stopping the LAC program does not do it. In Ice9B and later, stopping 
the LAC program will also shut OFF CollectTrace. 

(2) The CTB in question is in StopOnFull mode, and has become full. 

(3) You clear EnableCollect in the CTB’s R-CtbxColCtl. 

If none of these conditions have been met when you read-out the contents of a CTB, you will get all-zeros! 
This may give you the wrong idea that nothing was collected, or the wrong idea that you triggered and collected 
at a time when no activity was occuring on the signals being collected. To find out if CollectTrace is ON, read 
R_CtbxColCtl in any CTB, and look at bit “Collecting”. 
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11.6.1.7 Do You Need To Shut-Off CollectTrace? 


If you will-be or might-be running on Ice9A chips, and if your next use of OCLA has CTBs in StopOnFull mode, 
you probably want to shut-off CollectTrace (if it’s on) before configuring and initializing for that OCLA run. If 
your next use of OCLA has CTBs is rollover mode (StopOnFull==0) then CollectTrace being ON doesn’t matter. 

Methods of shutting-OFF CollectTrace are described later in the OCLA Programming Suggestions section. 

Why would CollectTrace be ON? In an Ice9A chip, the previously-run LAC program left it on, either due to a 
LAC program error, a trigger never occuring, or the LAC program was halted in the middle by a write of GO=0. 


11.6.2 Registers 


For “x” in the register names below, substitute desired collector name, from these: 
Ps0O, Psl, Ps2, Ps3, Ps4, Ps5, Cohe, Coho, Fswi, Fswo, Dma, Pmi. 


11.6.2.1 The Collection Control Register 
Register 
R_CtbxColCtl 


Attributes 


-writeonemixed 


Address 
0x00_0000 (plus base address) 


11:9 | ExtMuxSel RW External Mux Select for logic outside the CTB to select alter- 
nate capture input sources. Many units use “7” to disable flops 
or data to their CTB. (see Note 2, Note 3, Note 4) 


18 | EnableCollect Ea ha Collect Data when CollectTrace is asserted 


f Collecting R Will read as 1 when CollectTrace from LAC is asserted. Does 
not go to zero as you might expect when StopOnFull==1 and 
the CTB has become full. Also, it is unaffected by EnableCol- 
lect. 


6 | StopOnal | RW [0 |__| Stops collection when WeAddr overfows | 

5 DCtrClr W1C Clear Dead Cycle Counter - OBSOLETE, has no effect. (This 

Peer pee definition kept for backward compatibility.) 

4 WtAddrClr WIC Clear Write Address register. (see Note 5). Twc9 note: This 
bit should be moved to a different register, and -writeonemixed 
removed, as W1C mixed with normal write is annoying to SW. 

QTrigState RW The values that QualTriggerl and QualTrigger0 must be for 
collection, if qualification is enabled. You must leave these bits 
0 if not enabling qualification. 


QualTrig fees ee Enable”, with enables for QualTriggerl and 
Vee 


Note 1: In a given ee a collection of values on signals from the unit occurs when 4 things are 
true: (a) R-CtbxColCtl.EnableCollect==1, (b) lac_ctb_CollectTrace_c0a==1 (the “Collect” signal from LAC), (c) 
R_CtbxColCtl.StopOnFull==0 or the collector block is not full yet, (d) “qualification” is currently satisfied. “Quali- 
fication” = ((QualTrigger-input-0 & QualTrig[0]) == QTrigState[0]) && ((QualTrigger-input-1 & QualTrig[1]) == 
QTrigState[1]). 

Note 2: Actually, only COHe and COHo CTBs use the default value of 7 to disable activity (and see Note 
3). Most other units just feed zeros in on the collection data inputs of their CTBs. Unusual cases: In PMI, all 8 
ExtMuxSel settings, 0 - 7, are used for different sets of data to collect, except 5 which feeds zeros. In PSx, the 
lower-2 bits of ExtMuxSel choose between the 3 sets of data that can be collected, so ExtMuxSel settings 4 - 7 
repeat the same choices of data as settings 0 - 3, with 3 and 7 collecting 0’s for data. In Fswi and Fswo CTBs, 
mux settings 0 - 4 select different sets of signals, and mux settings 5, 6, 7 select the same data as muxSel=4. 
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Note 3: Due to a minor bug, in COHe or COHo, both the trigger block and collector block must have their 
muxes set to other than 7 to enable the external flops on signals coming into to either the trigger block or collector 
block. 

Note 4: Due to Bug 1959, affecting PMI only in Ice9A, the ExtMuxSel field of R-TrbcPmiiTrigCtl must be 
used to select input signals for PMI’s CTB, while the ExtMuxSel field in this register for PMI does nothing. This 
is fixed in Ice9B. 

Note 5: 

a) Some usages of a CTB seem to get the CTB “stuck” when followed by other later uses of that CTB, 
which then fail to collect. This behavior is not fully characterized. We find that doing 2 writes to this register is 
best. Both writes have your new desired ExtMuxSel, QTrigState, QualTrig. The first write has WtAddrClr=1, 
EnableCollect=0. The 2nd write has EnableCollect=1 and your desired StopOnFull setting. 

(b) You probably won’t run into this, but: As described in BUG 2026, which is “Won’t Fix” as of June 2006, any 
write to the SCB address-range of a specfic CTB, with bit-4 set in the write-data, will trigger R-CtbxColCtl.WtAddrClr, 
clearing that CTB’s R-CtbxWtAddr. Although, since there are no other writable registers in a CTB, software should 
not be doing writes to any SCB address other than R_CtbxColCtl, within a CTB. 


11.6.2.2. The RAM Lowbits 
Register 
R_CtbxRamLo[1023:0] 


Address 
0x00_1000-0x00_1fff (plus base address) 


[ard [Lobata PR [0 |_| Low 32 Bits of Trace RAM (RAMDatalSN0) 


11.6.2.3. The RAM Highbits 
Register 
R_CtbxRamHi[1023:0] 


Address 
0x00_2000-0x00_2fff (plus base address) 


31:0 | HiData R Bits of Trace RAM including the dead-cycle-count marker 
(RAMData[32,30:0]). 
You don’t get to see collected bit-31, but you do get to see the 
“dead cycle marker”. 


11.6.2.4 The Write Address 
Register 
R_CtbxWtAddr 


Address 
0x00_0010 (plus base address) 


Definition 


WtAddr R Current (Next) Write Address. To clear this, write 1 to WtAd- 
drClr bit in R_CtbxColCtl. For a CTB in StopOnFull mode, 
this will read back as 0 after the CTB has become full. 


This is an index into the 1024-entry Collector Ram. After collecting for awhile and then stopping collecting, 
the last entry collected will be at WtAddr-1, or at index 0x3FF if WtAddr=0. 
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Can you tell from the value in this register whether collection occured? If the value is not zero, then some 
collection did occur since the last time that bit WtAddrClr in R_CtbxColCtl was written to 1. But if the value 
is zero, you can’t tell. If StopOnFull=1 and enough is collected to fill the Collector Ram, then WtAddr will be 
back to zero again. If StopOnFull=0 and collection occurs for quite awhile then stops, you’ll usually see a non-zero 
WtAddr, but there’s 1 chance in 1024 that it will be zero. 


11.6.3 CTB Signals 


CTBs (“Collector Blocks” or “Capture Blocks”) provide samples of the important signals within functional blocks 
of the ICE9 that would be difficult to observe in a running system. The CTBs reside logically within the functional 
blocks of the units they are sampling and are instantiated in or near the unit whose data they will sample, and they 
are clocked by the same clock as the data to be sampled. In the signal names below, I will use “xclk” to represent 
the local clock domain. Each of the CTBs is connected to its own SCB slave unit. 


xxx_ctb_QualTrigger_x0a/1:0] xclk In When the CTB is placed in Qualified Collection mode, these 
inal has inputs control whether each sample is recorded or not. They 


should be tied high if this feature is not used. 


lac_ctb_Collect'Trace_c0a cclk In The LAC produces a single active-high signal telling all capture 
blocks to record data to their ring buffers. The CTB must 
synchronize the signal to xclk before using it. All CTBs route 


CollectTrace through a dual-rank synchronizer. 


ctb_xxx_SMuxSelLx1a[2:0] xclk | Out | Selects from among alternate SampleData inputs. By conven- 
tion, a mux select value of 7 indicates that the CTB is not in 
use, and that external flops related to the sample signals may 


have their clocks gated 


xxx_ctb_scbs_id[6:0] SCB Slave ID 
chaini_ctb_dat_r[2:0] Serial chain SCB input 
ctb_chaino_dat_r[2:0] Serial chain SCB output 


11.7 Hints for Using Collector Blocks 


11.7.1 Collecting the Event You Triggered On 


What you trigger-on is often what you want to collect and view. If you write your LAC program to branch on 
the trigger, then as fast as possible start collecting, you’ll miss the event you want to see by many clocks! This 
is because the trigger signal takes several clocks to get through the trigger block, the LAC and your LAC program 
take several clocks to respond to a trigger and drive the collect signal, and then the collector block takes a couple 
clocks to start collecting. 

The way to do this is: 


1. Turn-on continuous collecting in the collector block, and enable collector-block address wrap-around. 


2. Use the trigger in your LAC program to stop collecting, rather than to start collecting. If what you want 
to see is very short, just stop collecting when the trigger occurs. 


3. If what you want to collect is longer than the delays involved with OCLA components, then either: [a] For 
a little extra time, put some extra steps in your LAC program between trigger and stopping collecting, or 
[b] For more extra time, when the trigger occurs start one of the timers, and when the timer overflows stop 
collection. 


4. Find out where in the collector block the collection stopped by reading R-CtbxWtAddr (where “x” is your 
Ctb name). Then read a desired number of collector block entries leading up to (but not including) that 
collector block index, wrapping around from top to bottom of collector block, if needed. 
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Figure 11.3: Vector Comparison Trigger Block 


11.8 Vector Trigger Blocks (TRBVs) in general 


This section describes what’s common to all Vector Trigger Blocks. The signals available to trigger-on in each 
individual Vector Trigger Block are described in later sections. 

Each TRBV provides a mechanism for trigger comparison between a 32 bit input vector and up to 32 bits of 
value and mask state to produce a TMatch signal. The TMatch output of the trigger block is synchronous with the 
clock domain of the input data. It is the responsibility of the LAC to resynchronize this signal into the cclk domain. 
The TMatch output is true when (INDat AND Mask) == Value. Since the TMatch output is synchronized to the 
source data clock, it may persist for too short a time to be sampled by the cclk in the LAC. In these cases, the TRB 
is responsible for ensuring that the TMatch/SMatch pulse width is sufficiently wide to be sampled by a cclk. For 
trigger blocks in clock domains that are faster than CCLK, the “PulseStretch” bit in the TrigCtl register should be 
set to guarantee that any trigger match pulse is at least two clock cycles long. PulseStretch can also make it easyer 
to get events from two different trigger blocks to coincide. Each TRBV has two match outputs. (See Figure 11.3.) 
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11.8.1 SCB Performance Counter Connections 


In addition to providing triggers to the central OCLA LAC, each TRBV provides each of the 32 bits of Sam- 
pleDataIn[31:0] to SCB Performance Counters as events to count. The SCB Performance Counters mechanism can 
focus on just 2 signals from SampleDataIn[31:0], or it can sweep across several selections of those 32 signals. 

As described in the Serial Configuration Bus chapter, program the SubChipID (from the Addressing chapter) 
for the desired TRBV into bits 14:8 of a R_ScbPerfBuckets[255:0] “event” field, bits 7:5 must be zero, and bits 4:0 
are bit-number in SampleDataIn[31:0]. 

What if you want to count how often some or all of SampleDataIn[31:0] matches a pattern? This can be done 
for OCLA triggering purposes by the TRBV, but the SCB Performance Counters hookup to a TRBV is limited 
to just 2 bits of SampleDataln. You can count pattern matches by getting your events to count from LAC rather 
than directly from TRBV. LAC gives SCB Performance Counters the trigger outputs from all Trigger Blocks. 

How much does this limit simultaneous use of a TRBV for OCLA? Very little. A separate pair of muxes is 
provided for this purpose, so all of the internals of the TRBV in question can be configured as needed for OCLA. 
Only the external mux must be the same for both purposes. 

TRBV events sent to SCB Performance Counters are not stretched by R-TrbvxTrigCtl.PulseStretch. SCB 
Performance Counters has it’s own way to get the correct number of counts even if it’s in a different clock domain 
from the TRBV. 

The hardware wiring of these signals to SCB Performance Counters is accomplished by feeding them into the 
SCB slave embedded in the TRBV. 


11.8.2 Registers 


For “x” in the register names below, substitute desired vector trigger block name, from these: 
Fswi, Fswo, Dma. 


11.8.2.1 The Trigger Control Register 
Register 


R_TrbvxTrigCtl 


Address 


0x00_0000 (plus base address) 


5 PulseStretch | RW If set, all matches will be “repeated” in the xclk tic after the 
match was detected. 

4:2. | ExtMuxSel RW 7 External Mux Select for logic outside the TRBV to select al- 
ternate trigger input sources. (see Note 1) 


QTrigState rRW fo) Uf If QualTrig, then this is the value that W1[0] must match 


Ball QualTrig Fanaa ae | Enable qualification of trigger by W1[0] for both trigger0 and 
trigger1 


Note 1: Power conservation: The default mux select value of 7 indicates that the trigger block is not in use, 
and that external flops related to the sample signals may have their clocks gated. Of course, you'll be writing a 
value other than 7 in this field when you use any TRBV, because all instances of TRBVs have external muxes, and 
in no case does the value 7 select any input trigger sources. 


11.8.2.2 The Trigger Mask Registers 
Register 


R_TrbvxTrigMask{1:0] 
R_TrbvxTrigMask[0] controls Match0, R-TrbvxTrigMask[1] controls Match1. 


May 14, 2014 552 Rev 51328 


SiCortex Confidential 11.9. CODEWORD TRIGGER BLOCKS (TRBCS) IN GENERAL 


Address 
0x00_0010- COUT (plus base address) 


[Bi [ Mnemonic [ Access [Reset [ Type [Definition 
| 31:0[ Mask  [|RW [Oo | | Selects which bits from SampleDataln must match. 


11.8.2.3 The Trigger Match Registers 


Register 


R_TrbvxTrigMatch/[1:0] 
R_TrbvxTrigMatch|[0] controls Match0, R-TrbvxTrigMatch[1] controls Match1. 


Address 
0x00_0020- OROOAEN (plus base address) 


LE 


31:0 | Match Ox fEfttttf The value that SampleDatalIn must be, after masking by the 
above register, to cause the trigger. 
Defaults to nonzero value so that with a mask of zero, the 
match always fails until configured. 


11.8.3. TRBV Signals 


Trigger blocks (TRB) are instantiated in or near the unit whose data they will sample, and they are clocked by 
the same clock as the data to be sampled. In the signal names below, I will use “xclk” to represent the local clock 
domain. Each of the TRBs is connected to its own SCB slave unit. 


Deseription 
Active-low reset, which deasserts synchronous with xclk. 


xxx_trbv_SampleDataln_x0a[31:0] | xclk In | Data to be sampled. These signals are also connected to the 
event wires of the local SCB slave, “W0[31:0]” for your selected 
Trigger Mux value, in the later sections on each vector trigger 


block. 


xxx_trb_CodeValid_x0a “Code valid flag” used as input to the Qualifier 


trbv_lac_Match_x2a|1:0] xclk | Out | The trigger block asserts each of these signals when the vector 
comparison against their respective mask/match registers is 
true and the Qualifier is satisfied. Asserted for two successive 


xclk tics if PulseStretch is set 


trbv_xxx_SMuxSel_x1a[2:0] xclk | Out | Selects from among alternate SampleData inputs. By conven- 
tion, a mux select value of 7 indicates that the TRB is not in 
use, and that external flops related to the sample signals may 


have their clocks gated 


xxx_trbv_scbs_id][6:0] SCB Slave ID 
chaini_scbs_dat_r[2:0] Serial chain SCB input 
scbs_chaino_dat_r[2:0] Serial chain SCB output 


11.9 Codeword Trigger Blocks (TRBCs) in general 


This section describes what’s common to all Codeword Trigger Blocks. The signals available to trigger-on in 
each individual Codeword Trigger Block are described in later sections. 

Each Codeword TRB provides a mechanism for trigger comparison between up to four five bit codewords and up 
to three lists of “interesting” codes. For instance, the TRBC (shown in Figure 11.4) can be used to detect any READ 
operation directed at the COHE from a non-processor source by connecting CodeSample0 input to the COHE’s 
command input, and a CodeSamplel input to the TID input. (These connections are statically established.) We’d 
then load a 32 bit vector into TableO with a 1 in each position corresponding to the code for a CSW Read operation. 
We'd load a Tablel with a vector selecting all TID codes that come from the DMA or PCI/BBS widgets. Assuming 
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Figure 11.4: Codeword Trigger Block 


the Qualifier condition is satisfied (see below) the CodeMatch output would be equal to 3 each time a read from a 
non-processor widget arrived at COHE. 

CodeMatch may be qualified by looking at one or both of the “CodeValid” inputs. The control register selects 
which (or both) of the code Valid inputs are examined and what state they must be in to allow a match. 

Note that any of the three tables can be “examined” by any of three of the four code inputs. This allows 
triggering on events such that the Code match word could be set up (for example) to produce 1 for READs, 2 for 
WRITEs, and 3 for RETRIEs. 

The CodeMatch output of the trigger block is synchronous with the clock domain of the input data. It is the 
responsibility of the LAC to resynchronize this signal into the cclk domain. The TMatch output is true when 
(INDat AND Mask) == Value. Since the TMatch output is synchronized to the source data clock, it may persist 
for too short a time to be sampled by the cclk in the LAC. In these cases, the TRB is responsible for ensuring that 
the TMatch/SMatch pulse width is sufficiently wide to be sampled by a cclk. For trigger blocks in clock domains 
that are faster than CCLK, the “PulseStretch” bit in the TrigCtl register should be set to guarantee that any trigger 
match pulse is at least two clock cycles long. PulseStretch can also make it easyer to get events from two different 
trigger blocks to coincide. 

Both bits of the CodeMatch output from the TRB are connected to the central LAC and to the x_scbs_event([1:0] 


May 14, 2014 554 Rev 51328 


SiCortex Confidential 11.9. CODEWORD TRIGGER BLOCKS (TRBCS) IN GENERAL 
inputs of the associated SCB slave unit. 


11.9.1 SCB Performance Counter Connections 


Each TRBC provides its output triggers CodeMatch0 and CodeMatch1 to SCB Performance Counters as events 
that can be counted. 

As described in the Serial Configuration Bus chapter, program the SubChipID for the desired TRBC (from the 
Addressing chapter) into bits 14:8 of a R_ScbPerfBuckets “event” field. Bits 7:0 of “event” are don’t-cares. 

TRBCs in a faster clock domain may need to use R-TrbexTrigCtl.PulseStretch when sending triggers to LAC, 
but there’s no need to PulseStretch when providing events to SCB Performance Counters. SCB Performance 
Counters will get the correct number of counts even if it’s in a different clock domain from the TRBC. If you DO 
set PulseStretch, which you might want to if LAC needs the signals too, then SCB Performance Counters will get 
a much higher incorrect count. Note that there’s only one PulseStretch bit, controlling both outputs. 

How much does this limit simultaneous use of a TRBC for OCLA? If both CodeMatch0 and CodeMatch1 in 
a particular TRBC are used by SCB Performance Counters, then OCLA can only use that TRBC if it can use it 
with the exact same configurations. If only one CodeMatch is used by Performance Counters, then the other one 
can be configured as needed for OCLA, although the external mux and some of the internal muxes will have to be 
the same for both Performance Counters and OCLA. You can freely apply delays to these triggers within LAC, 
with no effect on them going to Performance Counters. 

The hardware wiring of CodeMatch0 and CodeMatch1 to SCB Performance Counters is accomplished by wiring 
them to the embedded SCB slave within the TRBC. This is independent from the pathway by which LAC provides 
all of its trigger-block triggers to SCB Performance Counters. 


11.9.2 Registers 
For “x” in the register names below, substitute desired codword trigger block name, from these: 
Ps0, Psl, Ps2, Ps3, Ps4, Ps5, Cohe, Coho, Fsw, Dma, Pmi, Pmii. 

11.9.2.1 The Trigger Control Register 

Register 
R_TrbexTrigCtl 


Address 
0x00_0000 (plus base address) 


Type | Definition 


[Definition 
19 PulseStretch | RW If set, all matches will be “repeated” in the xclk tic after the 
18:16 | ExtMuxSel RW 7 External Mux Select allows choice between multiple sets of 
trigger inputs feeding the same TRBC. (see Note 1) (see Note 
2) (see Note 3) 


aaa | wrasat__ RW [0 |__| Max S input Select SSCS 
Pisa: wast Pw 0 2 tap Steet 
SS 


| Mux0Sel | rRW [0 | Mux 0 Input Select 


| Mux O Input Select 
eran Qual{1]_ = (CodeValid[1:0] & QT1Mask[1:0]) == QT- 
Match1[1:0] 


| 5:4 | QTMaskl — | Rm Enable Qualified Trigger mode for CodeValid 0 or 1 or both 


QTMatcho Qual[0]_ = (CodeValid[1:0] & QTOMask[1:0]) == QT- 
Match0[1:0] 


QTMask0 |RW [0 | a Enable Qualified Trigger mode for CodeValid 0 or 1 or both 


Note: QTMatchl and QTMask1 affect CodeMatch1, QTMatch0O and QTMask0O affect CodeMatch0. 
Note 1: Power-saving: In most TRBC instantiations, where more than one set of trigger inputs is selected by 
ExtMuxSel, the default value of 7 indicates that the trigger block is not in use, and that external flops related to 
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the sample signals may have their clocks gated. Exceptions to this are the TRBCs in DMA and PMI which have 
only one set of input triggers, where the default value of 7 has no special meaning. 

Note 2: Due to a minor bug, in COHe or COHo, both the trigger block and collector block must have their 
muxes set to other than 7 to enable the external flops on signals coming into to etther the trigger block or collector 
block. 

Note 3: Due to Bug 1959, affecting PMI only in Ice9A, the ExtMuxSel field of R-TrbcPmiiTrigCtl must be 
used to select input signals for PMI’s CTB, while the ExtMuxSel field in R-CtbPmiColCtl does nothing. This is 
fixed in Ice9B. 


11.9.2.2 The Trigger Table Registers 
Register 
R_TrbexTrigTab[3:0] 


Address 
0x00_0010-0x001F (plus base address) 


Definition 
TTable a a Trigger Pattern for this Table 


11.9.2.3. The Qualifier Table Registers 
Register 
R_TrbcxQualTab/1:0] 


Address 
0x00_0020-0x0027 (plus base address) 


QTable fRW [of] Trigger Pattern for this Table 


11.9.3. TRBC Signals 


Trigger blocks (TRB) are instantiated in or near the unit whose data they will sample, and they are clocked by 
the same clock as the data to be sampled. In the signal names below, “xclk” to represents the local clock domain. 
Each of the TRBs is connected to its own SCB slave unit. 


In Active-low reset, which deasserts synchronous with xclk. 


xxx_trbc_CodeSamp3_x0a/4: Codeword3 to be tested. These signals are also connected to 
the event wires of the local SCB slave 
xxx_trbc_CodeSamp2_x0a/4: Codeword2 to be tested. These signals are also connected to 
the event wires of the local SCB slave 
xxx_trbc_CodeSamp1_x0aJ4: Codeword1 to be tested. These signals are also connected to 


the event wires of the local SCB slave 


xxx_trbc_CodeSamp0_x0a/4: Codeword0 to be tested. These signals are also connected to 


the event wires of the local SCB slave 


One of two “code valid flags” used as input to the Qualifier 


In 
In 
In 
In 
One of two “code valid flags” used as input to the Qualifier 
Out 


Each of these two bits is the selected bit from the correspond- 
ing QTable ANDed with the respective Qualifier bits. Asserted 
for two successive xclk tics if PulseStretch is set 


trbc_lac_CodeMatch_x2a[1:0] 


trbc_xxx_SMuxSel_x1a[2:0] Selects from among alternate SampleData inputs. By conven- 
tion, a mux select value of 7 indicates that the TRB is not in 
use, and that external flops related to the sample signals may 


have their clocks gated 
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xxx_trbe_scbs_id[6:0] SCB Slave ID 
chaini_scbs_dat_r[2:0] Serial chain SCB input 


scbs_chaino_dat_r[2:0] Serial chain SCB output 


11.10 Hints for Using Trigger Blocks 


11.10.1 Using CodeValid Signals 


The “CodeValid” or “Qualifier” signals hooked-up as inputs to most Vector and Codeword Trigger Blocks were 
conceived-of as a final “yes/no” on whatever other signals you’ve configured (by SCB) your Trigger Block to respond 
to. Unlike the other signals available for triggering, these are configured using bits in the main control register for 
your Trigger Block (the 2 Trig bits in R-TrbvxTrigCtl, or the 8 QT bits in R-TrbcxTrigCtl). But in use they’re not 
really all that different from the other trigger inputs. Any Trigger Block input signal can effectively say “yes/no” 
on the overall trigger output from that Block. In a Collector Block, qualifiers play a very special role, but in a 
Trigger Block they’re just one more signal which you can AND-into to the expression for one or both of the trigger 
outputs. They’re just programmed differently. 


11.10.2 Trigger Clock Domains 


Almost all of OCLA operates in cclk, including LAC and most Trigger and Collector Blocks. Only the exception 
is that FSW Trigger and Collector Blocks are in sclk domain. sclk will always be slower-than or same frequency as 
cclk. No phase relationship is gauranteed between sclk and cclk, even when at the same frequency. Furthermore, 
when at the same frequency, there’s a very small probability that on a signal going from sclk to cclk, a one- 
sclk-long pulse may not be seen at all in the cclk domain, due to over-time variations of on-which-cclk-edge the 
clock-syncronization logic decides to present a newly-changing sclk-domain signal. If Ice9 is operating with cclk 
faster than sclk this never happens, but you see occasional stretching of 1-sclk pulses from sclk domain becoming 
2-cclks long in cclk domain. 

Since triggers from Trigger Blocks are often 1 clock long, the loss of such a trigger pulse going from an FSW 
Trigger Block to the LAC would be a problem. The PulseStretch feature of Trigger Blocks provides a solution, 
making the trigger pulse 2 sclks long, which is sure to become at least one cclk long at the LAC. 

FSW Trigger Blocks being in a different clock domain from LAC causes another problem. The delay regs in 
LAC cannot be used for FSW triggers as easilly or reliably as they can for the other Trigger Blocks. 


11.10.3 Uses for the Delay Registers 

The LAC has separate delay registers for each trigger signal coming from each Trigger Block. Here are some 
uses for them: 
11.10.3.1 Aligning Mis-Aligned Signals From Same Trigger Block 


Often you want to trigger on a combination of signals from a trigger block, that while related to the same one 
event, happen on different clocks, like when one of the signals asserts 1 or 2 clocks later than the others. Use the 
2 trigger lines from that trigger block, one for each signal, then delay one of them in LAC. Either or both trigger 
lines could be from and-ed groups of signals. 


11.10.3.2 Aligning CodeValid or Qualifier with Other Triggers in a Trigger Block 


Line-up a signal or group of signals from a trigger block with the qualifier of that trigger block, if they differ 
by 1 or more clocks. Use one trigger line for the group of signals, unqualified. Use the other trigger line for the 
qualifier, qualifying “true” (mask=0, match=0). 


11.10.3.3 Aligning Triggers from Different Trigger Blocks 


This can compensate for one Trigger Block having more flops than the other between trigger signal source and 
LAC. This can adjust for an event in one Trigger Block occuring earlier than the related event in the other Trigger 
Block. 
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If the difference in time between these two triggers is too large for LAC’s Delay Registers, you might be able to 
your LAC program to wait for the first event, then wait for the 2nd, with a timeout at which point it goes back to 
waiting for the first event. Of course this only works if “first events” are separated by enough clocks. 


11.10.3.4 Provide Bigger Window for Coinciding Events 


Combined with PulseStretch, provide a wider window of “coinciding” between single-clock events from different 
trigger blocks, up to 7 cclks wide! To do this use enable PulseStretch on both trigger blocks, and then send the 
same trigger out both trigger ports of each trigger block. In the delay registers, skew the 2 triggers from a given 
trigger block by 2 cclks relative to each other, providing a “trigger == true” time of 4 cclks from each trigger block. 
Use 4 Aggregate Matches to “and” each trigger from one trigger block with each trigger from the other trigger 
block. Then, in your LAC program, loop waiting to branch on any of these 4 Aggregate Matches to the same one 
“got the event” LAC state. 


11.11 OCLA in use — PSx (Processor Segments) 


The 6 Processor Segments have 1 Trigger Block each, and 1 Collector Block each. For “x” in “PSx” substitute 
each of 0,1,2,3,4,5. 


11.11.0.5 Location of OCLA-PSx Blocks and Signals 


PSx signals for OCLA triggering and collection are in the CAC part of each PSx. 

From a usage point of view you don’t need to know where the Trigger and Collector Blocks of OCLA-PSx are 
located, but if you are looking at the Verilog code, you might get confused, so here’s the info: The Trigger Block 
for each PSx is located in it’s CAC, but the Collector Block is located in one of the COH units. COHe contains 
3 of the PSx Collector Blocks, and COHo contains the other 3. These 3 are not to be confused with COH’s own 
Collector Blocks, which are connected to COH signals. Each of COHe and COHo contains one COH collector block 
and 3 PSx collector blocks. 


11.11.1 PSx Triggers 


Each of the Processor Segments will have a codeword trigger capable of detecting events coming from the CSW, 
and internal L2 controller state. We want to watch lots more signals than we have inputs for a TRBC, so we 
provide an external mux to select from between trigger sources that are hopefully not both interesting at the same 
time. The following tables define the codeword triggers for the most interesting signals and signal combinations in 
the ICE9 Cache. For the cache unit there are four mux selectable groupings of codeword triggers. Each class below 
represents one of the four mux selectable groupings. Note that all signals listed are flopped once before entering 
trigger blocks. 


11.11.1.1 Processor Segment Trigger Mux 0 
Class 
TrbcPsxMux0 


Attributes 


-ocla -trbe -trbcpsx 
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| Bit | Mnemonic (Codeword Sample Input) (Signal) Definition 


Bit 

W3[2] CswToPsxCmdAddrGnt xxx_trbc_CodeSamp3Mux0[2] cac.csw_psx_CmdAddrGnt_cla We got the last command cycle. 
cclk after psx_csw_ECmdAddrRe« 
psx_csw_OCmdAddrReq_c0a 

Pwo fo xxx_trbc_CodeSamp3Mux0[1:0] Po Reserved always zero 

W4(0] Cv0OLatCmdAddrValid xxx_trbc_CodeValid0Mux0_x0a | cac.lat_xxx_CmdAddrValid_c2a CSW is sending a command to PSX 
same cclk as _ lat_xxx_Command_c2a 


lat-xxx-_CmdAddrTID_c2a 


W5(0] Cv1LatDataValid xxx_trbc_CodeValid1Mux0_x0a cac.lat_xxx_DataValid_c4a Incoming Data-Valid from CSW 


11.11.1.2 Processor Segment Trigger Mux 1 
Class 
TrbcPsxMux1 


Attributes 


-ocla -trbe -trbcpsx 


WO[1] CtlToSlcWinPrb xxx_trbc_CodeSamp0Mux1[1] cac.ctLslc_WinPrb_c6a This is a probe to L1 in response to a PRB- 
WO[0] CtlToSlcInvPrb xxx_trbc_CodeSamp0Mux1/(0] cac.ctlslc_InvPrb_c6a This is a probe to L1 in response to a 
W1[4:0] LatCmdAddrTid xxx_trbc_CodeSamp1Mux1[4:0] | cac.lat-xxx_CmdAddrTID-c2a[4:0] | Transaction ID for incoming request from 
W2[4: XXX_ : 

] 


CtlToLamPrbQState bc_CodeSamp2Mux1[4:0] cac.ctLlam_PrbQState_c4a[4:0] Probe-queue handler state 


r 
r 
r 
W3[4:3] SlcPrbDirty xxx_trbc_CodeSamp3Mux1[4:3] | cac.slc_xxx_PrbDirty-cya[1:0] Which of two 32 byte blocks in a probe were 
r 
r 


Cv1SlcToTagBiuMemAcc xxx_trbc_CodeValid1Mux1_x0a cac.slc_tag_BiuMemAcc_cya CPU to CAC request address is a memory 
access 


11.11.1.3. Processor Segment Trigger Mux 2 
Class 


TrbcPsxMux2 


Attributes 


-ocla -trbe -trbcpsx 


(Codeword Sample Taput) | Signal 


WO[4:0] xxx_trbc_CodeSamp0Mux2[4:0] | cac.lat-xxx_Command_c2a[4:0] Command code for incoming request from 
CSW 


W1[4:0] | LatCmdAddrTid xxx_trbc_CodeSamp1Mux2[4:0] | cac.lat-xxx_CmdAddrTID_c2a[4:0] | Transaction ID for incoming request from 
CSW 


W2[4] SlcToTagBiuWrite xxx_trbc_CodeSamp2Mux2[4] cac.slc_tag_BiuWrite_cya CPU to CAC request of a write, mem or IO 
W2([3] SleToDatPrbWbVal xxx_trbc_CodeSamp2Mux2([3] cac.slc_dat_PrbWbVal_cya Data in cz is a writeback from a probe 


May 14, 2014 559 Rev 51328 


SiCortex Confidential CHAPTER 11. ON CHIP LOGIC ANALYZER 


Bie Mnemonic (Codeword Sample Taput) | Giana 


W2 SlcBiuPaused cac.slc_xxx_BiuPaused_c2b Says SLC won’t send new requests until 
pause deasserts 


cac.slc_xxx_PrbDone_cya|[1:0] Probe for both blocks has completed 


cac.ctldat-PrbRdReq_cda Read a block out of the L2 and write it to 

[See_Note_1 ctlLdat-WtPrb2L2_c5a ORed w 
pe all ctldat_PrbState_c5a[2] 

cac.ctldat_PrbState_c5a[1:0] 


Notes: 


1. cac.ctLdat_WtPrb2L2_c5a || cac.ctLdat_PrbState_c5a[2] in Ice9A. This was a mistake, bug 1995, which makes it hard 
to trigger on all 3 bits of ctldat_PrbState_c5a[2:0]. With only the lower 2 bits we can distinguish between four Cac 
State possibilities: O=INV, 1=EXCL, 2=SHARE-or-DIRTY, 3=UPDATED. Signal ctl.dat-WtPrb2L2_c5a means for 
BRD writebacks to CSW, also write data to L2. Fixed in Ice9B to be just be cac.ctldat_PrbState_c5a[2], allowing 
triggering on all Cac States. 


2. When the probe data is sent along, ctldat_PrbState_c5a is the state that should be propagated (all 3 bits, that is). 
PrbState is of type CacState, not CacPrbQState. 


11.11.1.4 Processor Segment Trigger Mux 3 
Class 
TrbcPsxMux3 


Attributes 


-ocla -trbe -trbcpsx 


Bit (Codeword Sample Input) (Signal) Definition 
4:0] PsxToCswCmd xxx_trbc_CodeSamp0Mux3[4:0] | cac.psx_csw-Command_cOa[4:0] | Processor Segment to CSW Command 


CtlToTagInvReq xxx_trbc_CodeSamp1Mux3[4 cac.ctLtag_InvReq_cda Invalidate Request, reqAddr block should 
CtlToTagWinReq xxx_trbc_CodeSamp1Mux3[3 cac.ctLtag_WinReq_c5a In Biu pause, doing writeback & invalidate 
W1[2 CtlToTagBrdReq xxx_trbc_CodeSamp1Mux3[2 cac.ctlLtag_BrdReq_c5a In Biu pause, doing block-read for PRB- 
Wi{1 CtlToTagBwtReq xxx_trbc_CodeSamp1Mux3[1 cac.ctlLtag_BwtReq_cda In Biu pause, doing block-write for 
CtlToTagShrReq xxx_trbc_CodeSamp1Mux3[0 cac.ctLtag_ShrReq_cda In Biu pause, going to shared state for 


El 


z 


z 


W1[0 

W2[4 SlcToTagBiuWrite xxx_trbc_CodeSamp2Mux3[4 cac.slc_tag_BiuWrite_cya CPU to CAC request of a write, mem or IO 
W2[3 SlcToTagBiuRead xxx_trbc_CodeSamp2Mux3[3 cac.slc_tag_BiuRead_cya CPU to CAC request of a read, mem or IO 
W2[2 SlcToDatPrbWbVal xxx_trbc_CodeSamp2Mux3[2 cac.slc_dat_PrbWbVal_cya Data in cz is a writeback from a probe 
W2[1 


SlcToTagBiuMemAcc xxx_trbc_CodeSamp2Mux3 cac.slc_tag_BiuMemAcc_cya CPU to CAC request address is a memory 
access 


W2([0 SlcToTagIFetch xxx_trbc_CodeSamp2Mux3[0 cac.slc_tag_IFetch_cya Instruction stream Fetch 


W3[4 TagToCtlWO0Miss xxx_trbc_CodeSamp3Mux3[4 cac.tag_ctI.WOMiss_cza (Tag-Miss on Way-0, or Idle) and not IO- 
access [See Note 2] 


W3[3 TagToCtlW1Miss xxx_trbc_CodeSamp3Mux3[3 cac.tag_ctl1.W1Miss_cza (Tag-Miss on Way-1, or Idle) and not IO- 


access [See Note 2] 


W3[2 TagToCtlPrbHit xxx_trbc_CodeSamp3Mux3[2 cac.tag_ctl PrbHit_c6a The incoming probe op hit on the L2 
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Bit (Codeword Sample Input) (Signal) Definition 
W3[1:0] | TagToCtlBlkState xxx_trbc_CodeSamp3Mux3[1:0] | cac.tag_ctI.BlkState_cza[1:0] State of block we got a hit on 


Wwa4(0] CvOSlcToTagBiuMemAcc xxx_trbc_CodeValid0Mux3_x0a cac.slc_tag_BiuMemAcc_cya CPU to CAC request address is a memory 
W5(0] Cv1PsxToCswXCmdAddrReq | xxx_trbc-CodeValid1Mux3_x0a | [See_Note_1] PSX to CSW Even or Odd Cmd Address 


Notes: 


1. cac.psx_csw_ECmdAddrReq_c0a || cac.psx_csw_-OCmdAddrReq_c0a; // Request by Cac for either the even- 
bound or oddbound CSW Cmd Address Bus. 


2. Bug2243: In Ice9A each of these “WOMiss, W1Miss” signals will be asserted when their “way” (WO or W1) has 
a tag-miss on a Biu Memory Access, or anytime accessing tags is idle. This means they’re similar to “~ Hit” 
signals, except that for processor IO accesses, both of these will be 0 (which does not mean “Hit”). This is 
because tags are bypassed during IO accesses. To eliminate both Idles and IO-accesses, configure OCLA so 
that slc_tag_BiuMemAcc_cya must be true when looking for WOMiss or W1Miss to be either true or false. 
These trigger bits are improved in Ice9B to be tag_ctL.WOHit and tag_ctL W1Hit. 


3. Bug2243: In Ice9A signals tag_ctL.WOMiss_cza and tag_ctl.W1Miss_cza are 1 cclock later than the other 
related signals provided, for a given access event. This means that to condition WOMiss or W1Miss with 
another signal you'll have to use both codeword trigger outputs, and then in LAC delay one relative to the 
other. This is fixed in Ice9B. 


11.11.22 PSx Collectors 


Each of the six PS CTBs contain the following mux inputs and signals. 


11.11.2.1 PSx Input Collectors Qualifying Triggers 
Class 
CtbPsxQtrig 


Attributes 


-ocla -ctb -ctbpsx 


(GiB hp Signa 
LatCmdAddrValid xxx_ctb_QualTrigl_x0a cac.lat-xxx_CmdAddrValid_c2# CSW is sending a command to PSX 


0) SlcToTagOp xxx_ctb_QualTrig0_x0a [See_Note_1] CPU to CAC request of a read or write, 
mem or IO 


Notes: 


1. cac.slc_tag_BiuRead_cya || cac.slc_tag_BiuWrite_cya; 


11.11.2.2 PSx Input Collector Mux 0 
Class 
CtbPsxMux0 


Attributes 


-ocla -ctb -ctbcac 


(GTB Tapw) Cigna Definition 


a TagToCtlW1Miss xxx_ctb_SampleDataIn0_x0a[31] cac.tag_ctLW1Miss_cza (Tag-Miss on Way-1, or Idle) and not IO- 
access 
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Bit Mnemonic (CTB Input) (Signal) Definition 


| Signal) 
30 TagToCtlWO0Miss xxx_ctb_SampleDataIn0_x0a[30] cac.tag_ctI.WOMiss_cza (Tag-Miss on Way-0, or Idle) and not IO- 
5 CtlToLamPrbQState Probe-queue handler state 
24:21 SlcToLamRdyStatel xxx_ctb_SampleDataIn0_x0a[24:21] cac.slc_lam_RdyStatel_c2a[3:0] Ready state from the SLC, pclk number 1 
[See Note 1] 

20 CswToPsxDataGnt xxx_ctb_SampleDataIn0_x0a[20 cac.csw_psx_DataGnt_c3a Cache switch to processor segment data 
PsxToCswODataReq xxx_ctb_SampleDataIn0_x0a[19 cac.psx_csw_ODataReq_c2a Processor segment to cache switch odd data 
PsxToCswEDataReq xxx_ctb_SampleDataIn0_x0a[18 cac.psx_csw_EDataReq_c2a Processor segment to cache switch even 
CswToPsxCmdAddrGnt xxx_ctb_SampleDataIn0_x0a[17 cac.csw_psx-CmdAddrGnt_cla Cache switch to processor segment com- 


mand grant 


PsxToCswOCmdAddrReq xxx_ctb_SampleDataIn0_x0a[16 cac.psx_csw_OCmdAddrReq_c0a Processor segment to cache switch odd 


command request 


PsxToCswECmdAddrReq xxx_ctb_SampleDataIn0_x0a[15 cac.psx_csw_ECmdAddrReq_c0a Processor segment to cache switch even 


command request 


4 
3:10 


Always0 xxx_ctb_SampleDataIn0_x0a[14 [Always_Zero] Reserved 

SlcToLamRdyState0 xxx_ctb_SampleDataIn0_x0a[13:10] cac.slc_lam_RdyState0_c2a[3:0] Ready state from the SLC, pclk number 0 
[See Note 1] 

cac.lat_xxx_CmdAddrTID_c2a[4:0] | Command TID 


9:5 LatCmdAddrTid xxx_ctb_SampleDataIn0_x0a 


4:0 


LatCmd xxx_ctb_SampleDatalIn0_x0a 


= 
LS 


o 
Acar 


cac.lat_xxx_Command_c2a|[4:0] Command 


NO 
a N 6 


Notes: 


1. The CPU runs on pclk, twice as fast as cclk, so for OCLA (in cclk) to see the sequence of ready states in the 
CPU, 2 successive pclk states are passed into Cac and into this collector block on each cclk. See RdyStatel in 
collector bits 24:21, and RdyState0 in collector bits 13:10. RdyState0 occurred in the CPU before RdyStatel. 

11.11.2.3. PSx Input Collector Mux 1 
Class 


CtbPsxMux1 


Attributes 


-ocla -ctb -ctbpsx 


(CB Cima 
31:22 | LatAddrHi xxx_ctb_SampleDataIn1_x0a[31:22] | cac.lat-xxx_Addr_c2a[35:26] 10 upper Address bits [35:26] 


21:10 | LatAddrLo xxx_ctb_SampleDataIn1_x0a[21:10] | cac.lat-xxx_Addr_c2a[14:3] 12 lower Address bits [14:3 
LatCmdAddrTid | xxx_ctb-SampleDataIn1_x0a[9:5] cac.lat_xxx_CmdAddrTID_c2a[4:0] | Command TID 


11.11.2.4 PSx Input Collector Mux 2 
Class 
CtbPsxMux2 


Attributes 


-ocla -ctb -ctbpsx 


[Bie [Mnemonic | (CFB Taw Gena 
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(CPB Inpwe) (Signa 


31:27 | PsxToCswCmd xxx_ctb_SampleDataIn2_x0a[31:27] | cac.psx_csw_Command_cOa|[4:0] Processor segment to cache switch com- 
mand 


CtlToSlcInvPrb xxx_ctb_SampleDataIn2_x0a cac.ctl_slc_InvPrb_c6a 
CtlToSlcWinPrb xxx_ctb_SampleDataIn2_x0a[2é cac.ctlslc_WinPrb_c6a 


SlcToLamRdyStatelbit3 | xxx_ctb_SampleDataIn2_x0a cac.slc_lam_RdyStatel_c2a[3] Bit3 of RdyStatel, a mistake, but can be 
used 


: eDa 18:17] tbs 
16 D. 6 


18:17 TagToCtlBlkState xxx_ctb_Samp aIn2_x0a cac.tag_ctl BlkState_cza[1:0] bg 


TagToCtlW1Miss aIn2_x0a cac.tag_ctI.W1Miss_cza See Note 2: 
Ice9A - Tag Miss or Idle on Way-1 (same as 
~Hit) 
Ice9B - Tag Hit on Way-1 


aIn2_x0a[1£ cac.tag_ctlI-WOMiss_cza See Note 2: 
Tag Miss or Idle on Way-0 (same as ~ Hit) 


Ice9B - Tag Hit on Way-0 


b_SampleDataIn2_x0a[14:13] | cac.slc_xxx_PrbDirty_cya[1:0] 

: b-SampleDataIn2_x0a[12:11] | cac.slc-xxx_PrbDone_cya[1:0] 

10 xxx_ctb_SampleDataIn2_x0a[10] cac.ctldat_WtPrb2L2_c5a 

xxx_ctb_SampleDataIn2_x0a[9 cac.ctldat-PrbRdReq_c5a 

xxx_ctb_SampleDataIn2_x0a[8 cac.slc_xxx_BiuPaused_c2b 

xxx_ctb_SampleDataIn2_x0a[7 cac.slc_dat_PrbWbVal_cya 
bs 


SlcToLamRdyState0 xxx_ctb_SampleDataIn2_x0a[6:3] cac.slc_lam_RdyState0_c2a[3:0] 
2 Ss D 


cToTagBiuRead xxx_ctb_SampleDataIn2_x0a[2 cac.slc_tag_BiuRead_cya 


SlcToTagBiuWrite xxx_ctb_SampleDataIn2_x0a[1 cac.slc_tag_BiuWrite_cya 
0) SlcToTagBiuMemAcc xxx_ctb_SampleDataIn2_x0a[0 cac.slc_tag_BiuMemAcc_cya 


Note 1: 


case ({cac.ctLtag_InvReq_cda, cac.ctltag_WinReq_cda, cac.ctltag_BrdReq_ca, cac.ctlLtag_BwtReq_cda, 
cac.ctltag_ShrReq_c5a}) 


5’b00001 : xxx_ctb_SampleDataIn2_x0a[23:21] <= 3’d1; // ShrReq 
5’b00010 : xxx_ctb_SampleDataIn2_x0a[23:21] <= 3’d2; // BwtReq 
5’b00100 : xxx_ctb_SampleDatalIn2_x0a[23:21] <= 3’d3; // BrdReq 
5’b01000 : xxx_ctb_SampleDatalIn2_x0a[23:21] <= 3’d4; // WinReq 


5’b10000 : xxx_ctb_SampleDataln2_x0a[23:21] <= 3’d5; // InvReq 
default : xxx_ctb_SampleDataln2_x0a[23:21] <= 3’d0; // none of the above, or more-than-one of the above 
endcase 
Note 2: 
In Ice9A bits 15 and 16 are cac.tag_ctL WOMiss_cza and cac.tag_ctIW1Miss_cza. 
In Ice9B and later bits 15 and 16 are cac.tag_ctL WOHit_cza and cac.tag_ctL W1Hit_cza. 


11.11.2.5 PSx Input Collector Mux 3 
Class 
CtbPsxMux3 


Attributes 


-ocla -ctb -ctbpsx 
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(CEB Input) (Signa 


11.11.2.6 PSx Input Collector Mux 4, 5, 6, 7 


The data mux leading into PSx CTBs has only the lower 2 bits of ExtMuxSel wired-up, selecting between the 
4 options described above. This means ExtMuxSel values 4,5,6,7 give you the same data choices as 0,1,2,3. 


11.12 OCLA in use — COHx 


“COHx” means either of COHe or COHo. 


11.12.0.7 COHx Trigger and Collector Enabling 


Due to a minor bug affecting COHx only, both Trigger and Collector Blocks must be enabled to use either. By 
“enabled” I mean setting their external muxes to other than 7. COHe and COHo are separately enabled. They all 
default to 7, which disables OCLA activities, saving power. 

For example: If all I wanted to use was the COHo Collector Block (triggering was done elsewhere, not in COH), 
I would need to set COHo Collector Block External Mux to the setting for what I wanted to collect, and I would 
need to set COHo Codeword Trigger Block External Mux to any value other than 7. COHe external muxes could 
be left at their default values. 


11.12.1 COHx Triggers 


The following tables define the codeword triggers for both the Even and Odd coherence controllers. For the 
ICE9, the coherence units provide up to four mux selectable groupings of codeword triggers. Each class below 
represents one of the four mux selectable groupings. 


11.12.1.1 COHx Codeword Trigger Mux 0: Trigger on incoming command/source/data-op + tag- 
results + orc/wbc hit 
Class 
TrbcCohxMux0 


Attributes 


-ocla -trbe -trbccohx 


CohToDdrRdShootDwn xxx_trbc_CodeSamp1Mux0_x0a m_coh_ddr_RdShootDown_cda tbs 
|| m-RaWShootDown_c4a 


xxx_trbc_CodeValid0_x0a Hardwired to logic ’1’ 
W5 Cv1InCmdAddrVal xxx_trbc_CodeValid1_x0a 


Note that WO signals are delayed by 1 cclk compared with InCmdAddrValid and other signals, and W1 signals 
are delayed by 2 cclks compared with InCmdAddrValid and other signals. 
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11.12.1.2 COHx Codeword Trigger Mux 1: Trigger on ORC/WBC behavior + incoming command 
Class 
TrbcCohxMux1 


Attributes 


-ocla -trbe -trbccohx 


Bie (Codeword Sample Tapa) 

tbe Codes OH 

rb Codes Oa 

tre Codes Oa 
0] | y 


a pOMux1_x0a 


Notes: 


OreToCltDdrHit m_orc_ctL DDRHit_cl2a DDR RAM hit flag in the ORC 
i 1] 


1. (orc_ctLDDRHit_cl2a? orc_ctLDDRDepTIDcl2a: 0) | (orc_ctLPrbHit_c4a ? orc_ctIPrbDepTID_c4a: 0) | (orc_ctLDepVal_c5a 
? wbe_ctLDepTID_c5a: 0) | (wbc_ctLWrsHit_c7a ? wbc_ctLWrsTID_c7a : 0) | (wbc_ctLBwtCanHit_c4a ? wbc_ctLBwtCanDepTID_ 


11.12.1.3 COHx Codeword Trigger Mux 2: Trigger on the DDR Interface 


TrbcCohxMux2 


Attributes 


-ocla -trbe -trbccohx 


2 cvcoh.WOTTDVacein | es 
] 
| | 


wale SecNves : 
W5(0] m_cmd_xxx_InCmdAddrValid_c3a | tbs 


Notes: 


i 


z 


5 


i 


1. m_coh_ddr_RdShoot Down_cda || m_coh_ddr_RaWShootDown_c4a 


2. ((m_ddr_coh_DataValid_c3a || m_ddr_coh_RdShotDown_c3a) ? m_ddr_coh_DataTID_c3a : 0x00) | (m_ddr_coh_WtTIDVal_c6a 
? m_ddr_coh_WtTID_c6a : 0x00) 


3. m_coh_ddr_WrValid_c6a ? m_coh_ddr_WrTID_c6a : Ox1f 
4. m_cohddr_RdValid_c3a ? m_cohddr_RdTID_c3a : Oxlf 
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11.12.1.4 COHx Codeword Trigger Mux 3: Trigger on an Incoming Address 
Class 
TrbcCohxMux3 


Attributes 


-ocla -trbe -trbccohx 


(Codeword Sample Tuput) | (COW Signal) 


WO/[4:0] | InAddr xxx_trbc_CodeSamp0Mux3[4:0] | m_cmd_xxx_InAddr_c3aJ[8:7], Incoming ? Address 
m_cmd_xxx_InAddr_c3a[5:3] 
W1[4:0] | InPageAddr xxx_trbc_CodeSamp1Mux3[4:0] | m_cmd_xxx_InAddr_c3a[20:16] Incoming Page Address 


:7], Outgoing ? Address 
5: 


7] 

7 
W23/[4:0] | OutRdAddr2 xxx_trbc_CodeSamp3Mux3[4:0] | m_cohddr_RdAddr_c3a[8:7], (Same as mux selection 2) 
ere [eictunsacnes [On 


W4(0] Cv0Alwaysl xxx_trbc_CodeValid0_x0a Hardwired to logic 71’ 
W5(0] Cv1InCmdAddrValid xxx_trbc_CodeValid1_x0a m_cmd_xxx_InCmdAddrValid_c3a 


W2[4:0] | OutRdAddr1 xxx_trbc_CodeSamp2Mux3[4:0] | m_cohddr_RdAddr_c3a[8 
m_cohddr_RdAddr-_c3afé 
8 


11.12.2 COHx Collectors 


Each of the COHx units, Cohe (even) and Coho (odd), will have a collector to record commands, TIDs, and 
tag indices arriving at that COH. 

Note: If you are looking at the COH source code, you’ll see 4 OCLA collectors instantiated in Cohe and 4 in 
Coho! These are 1 for the COHx unit, and 3 for PSx units. When using OCLA, you don’t have to pay attention 
to where the collectors are actaully instantiated, all you care about is what signals they’re hooked to. So, for 
functional purposes, each COHx has only 1 OCLA Collector. 


11.12.2.1 Cohx Input Collectors Qualifying Triggers 
Class 
CtbCohxQtrig 


Attributes 


-ocla -ctb -ctbcohx 


(CPB Tapa) (Signa 


InCmdAddrValid | xxx_ctb_QualTriggerl_x0a | m_cmd_xxx_InCmdAddrValid_c3a 
sac. Qualigger0-xida | [See Note 


Notes: 


1. m_coh_csw_OutDataTarget_c3a[0] || m_coh_csw_OutCmdAddrTarget_cla[0] 


11.12.2.2 Cohx Input Collector Mux 0 
Class 
CtbCohxMux0 


Attributes 


-ocla -ctb -ctbcohx 


(CEB Tape Gena 


31:16 | InAddrHi xxx_ctb_SampleDataIn0_x0a[31:16] | m_InAddr_c3a[31:16] Page Address [31:16] 
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7 (Cis pw Tiana 
15:13 | InAddrLo xxx_ctb_SampleDataIn0_x0a[15:13] | m_InAddr_c3a[5:3] Page Address [5:3] 


InCmdAddrTid | xxx_ctb-SampleDataIn0_x0a[7:3] m_InCmdAddrTID_c3a[4:0] | Incomming TID 


xxx_ctb_SampleDataIn0_x0a|[2:0] ml_Owner_c4a[2:0] Block Owner 


11.12.2.3. Cohx Input Collector Mux 1 
Class 
CtbCohxMux1 


Attributes 


-ocla -ctb -ctbcohx 


12:8 
7:3 
2:0 


InCmd xxx_ctb_SampleDataIn1_x0a[12:8] m_cmd_xxx_InCommand_c3a[4:0] ncoming command 


InCmdAddrTid xxx_ctb_SampleDataIn1_x0a 


Pal 
oo 


:3] m_cmd_xxx_InCmdAddrTID_c3a[4:0] ncoming command TID 


xxx_ctb_SampleDataIn1_x0a[31 ORC Cache Address Hit 
WhbcToCtlAddrHit xxx_ctb_SampleDataIn1_x0a[30 Write Back Cache Address Hit 
OrcToCtlDdrHit xxx_ctb_SampleDataIn1_x0a[29 DDR Hit 
OrcToCtlPrbHit xxx_ctb_SampleDataIn1_x0a[28 Cache Probe Hit 
WhbcToCtlDepVal xxx_ctb_SampleDataIn1_x0a[27 tbs 

WbcToCtlDepShr xxx_ctb_SampleDataIn1_x0a[26 tbs 

WbcToCtlWrsHit xxx_ctb_SampleDataIn1_x0a[25 tbs 
WbcToCtlBwtCanHit xxx_ctb_SampleDataIn1_x0a[24 tbs 
DdrToCohDvOrRdShtDwn xxx_ctb_SampleDataIn1_x0a[18 [See_Note_2] tbs 
CohToDdrRdShootDwn xxx_ctb_SampleDataIn1_x0a[17 tbs 

DdrToCohDataTid xxx_ctb_SampleDataIn1_x0a[16:13] | m_ddr_coh_DataTID_c3a[4:1] bs 

[128 

p73 | 

| 20 


tle fte 


Owner xxx_ctb_SampleDataIn1_x0a[2:0] ml_Owner_c4a[2:0] bs 


Notes: 


1. mlDepTID_ca = 


@ 


a> a 
= 


m_orc_ctlPrbHit_c4a ? m_orc_ctlPrbDepTID_c4a : 0x00) 
m_wbc_ctlLDepValc5a ? m_wbc_ctlDepTID_c5a : 0x00) 
m_wbc_ctLWrsHit_c7a ? m_wbc_ctlLWrsTID_c7a : 0x00) 
m_wbc_ctLBwtCanHit_c4a ? m_wbc_ctLBwtCanDepTID_c4a : 0x00); 


Cae aS 
ao 
ay SE 


(m_orc_ctL.DDRHit_cl2a ? m_orc_ctLDDRDepTID_cl2a : 0x00) 
| 
| 
| 
| 


( 
( 
( 
( 


— 
o) 


2. (m_ddr_coh_DataValid_c3a || m_ddr_coh_RdShot Down_c3a) 


3. (m_coh_ddr_RdShootDown_cda || m_coh_ddr_RaWShootDown_c4a) 


11.12.2.4 Cohx Input Collector Mux 2 
Class 
CtbCohxMux2 


Attributes 


-ocla -ctb -ctbcohx 


Bit Mnemonic (CTB Input) (Signal) Definition 
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it 
xxx_ctb_SampleDataIn2_x0a[2:0] ml_Owner_c4a[2:0] tbs 


7 


Notes: 


1. ((m_coh_csw_OutCmdAddrTarget_cla != 0x000) ? m_coh_csw.OutCommand_cla : E-CohCmd_IDLE) where E.CohCmd_IDLE 
= 0x07 


2. (m_coh_csw_OutDataTarget_c3a != 0x000) ? m_coh_csw_OutDataTID_c3a : Ox1f 


11.12.2.5 Cohx Input Collector Mux 3 
Class 
CtbCohxMuxs 


Attributes 


-ocla -ctb -ctbcohx 


(CEB Tapa) (Signa 


InAddrDW xxx_ctb_SampleDataIn3_x0a[31:0] | m_cmd_xxx_InAddr_c3a[35:4] | Full physical address [35:4] 


11.12.2.6 Cohx Input Collector Mux 4 
Class 
CtbCohxMux4 


Attributes 


-ocla -ctb -ctbcohx 


eS sam 


31:0 CycCtr xxx_ctb_SampleDataIn4_x0a[31:0] | m_FreeRunCtr_x0a[31:0] A free running 32 bit counter that increments ev- 
ery CCLK cycle. Counting is not affected by 
mux selections or enabling of OCLA. Not settable. 
Cleared during reset. Rolls over. 


This allows you to time-stamp collections within the rollover time of 2**32 cclks. This can be used in parallel 
with any other collector block. Since this uses either the COHe or COHo collector block, you cannot collect signals 
in both COHe and COHo and get these timestamps all at once. 

In the COHe or COHo used, make sure to set both Collector and Trigger external muxes to non-7 values, even 
if no COH triggers are needed, otherwise this collector remains disabled. 

Since 2**32 cclks has probably occured many times since un-reset, the usefulness of this is limited to relative 
times between two or more periods of collection driven from a LAC program. If your LAC program collects, then 
stops collecting, then starts collecting, then stops collecting, the values stored from this counter can tell you how 
long that middle time-period of not-collecting was. This can show you the time between two events, if you are 
confident that less time than 2**32 cclks has passed. One way to be sure only a short time passed is to program 
LAC with one of it’s counters as a time-out on the middle-non-collecting time period. Another way to be sure less 
than 2**32 cclks have passed is for whatever processor code starts the LAC program and then checks for “done” 
flags, to read it’s own CPU internal cycle counter while polling for “done”, or just have a software timeout on polling 
for “done”. 
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11.12.2.7 Cohx Input Collector Mux 5, or 6 


Collect all zeros. 


11.12.2.8 Cohx Input Collector Mux 7 
Disable CTB. 


11.13 OCLA in use — FSW 
11.13.1 FSW Triggers 


We'd like to be able to trigger on different events occuring at the FSW input and output ports. However, the 
FSW has three in/out ports from the DMA engine and three more in/out ports to the fabric link logic. That’s way 
too much stuff to be recording and hooking on to. So we instrument DMA to FSW port-0, FSW to DMA port-0, 
FLR to FSW port-0, and FSW to FLT port-0. 

There are three trigger units. These trigger units give us the ability to detecting start of packet/end of packet 
events, transitions to and from mission mode, poisoned packets and interesting routes. Trigger inputs from the 
control signals are routed to a Codeword TRB (TRBC) as shown in section 11.13.1.1. Four groups of 32 bits from 
the input data paths are routed to one Vector TRB, while four groups of 32 bits from the output data paths are 
routed to a second Vector TRB. Control paths to and from the links are also routed to these Vector TRBs. The 
Vector TRB connections are described in sections 11.13.1.2 to 11.13.1.11. 


11.13.1.1 FSW Codeword Trigger Block Inputs 

The Fabric Switch codeword trigger blocks define sets of events that can be enabled separately or grouped 
together to provide interesting triggers for events within the Fabric Switch (FSW). 
Class 

TrbcFsw 


Attributes 


-ocla -trbec -trbcfsw 


Bi (Cadeword Sample Tapat) | Gignal 
F FswDat Val xxx_trbc_CodeSamp0[4 flr0O_fsw_Dat Val_s0a Data Packets Data-Valid from FLR-0 
xxx_trbc_CodeSamp0[3 flrO_fsw_SoP_s0a Start of Data Packet from FLR-0 
F 


xxx_trbc_CodeSam ArO_f EoP_s0a End of Data Packet from FLR-0 


rToF 
rToF 
rToFswSopD1 xxx_trbc_CodeSamp0[2 ocla_flr_fsw_sop_d1 Start of Data Packet from FLR-0, delayed 
ToFs sw 
FswToFlrNewCtlPktD1 xxx_trbc_CodeSam ocla_fsw_flr_newctlpkt_d1 Start of Control Packet to FLR-0, delayed 
eee ee 
F; sw 


Dat Val xxx_trbc_CodeSam fit O_f: Dat Val_s0a Control Packets Data-Valid from FLT-0 
xxx_trbc_CodeSamp1|[é fi fAitO_SoP_s2a Start of Data Packet to FLT-0 
S) 


‘sw = 
xxx_trbc_CodeSam a_fsw_flt_sop_d1 
sclk 
‘sw EoP.s 


Ww. 
FswToFlt Hop xxx_trbc_CodeSam fsw_fltO_EoP_s2a nd of Data Packet to FLT-0 
m 


ro) 
FlitToFswNewCtlPktD1 xxx_trbc_CodeSam ocla_flt_fsw_newctlpkt_d1 Start of Control Packet from FLT-0, de- 
ayed 1 sclk 
] d 


c 
c 

bc_CodeSam a_fsw_SoP0_s0a Start of Packet from DMA port TX0 
c 
c 


bc_CodeSam ocla.dma_fsw_sop_d1 Start of Packet from DMA port TX0, de- 
bc_CodeSam ocla.dma_fsw_sop_d2 Start of packet from DMA port TX0, de- 


bc_CodeSam dma_fsw_EoP0_s0a End of packet from DMA port TX0 


art of Data Packet to FLT-0, delayed 1 
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(Codeword Sample Taput) | Chae 


xxx_trbc_CodeSamp2[0 fsw_dma_BufAvail0_s3a FSW Buffer Available signal to DMA port 
TXO 


fsw_dma_SoP0_s2a Start of Packet to DMA port RXO 


3[4 
xxx_trbc_CodeSamp3[3 ocla_fsw_dma_sop_d1 Start of Packet to DMA port RXO, delayed 
1 sclk 
xxx_trbc_CodeSamp3 ocla_fsw_dma_sop_d2 Start of Packet to DMA port RX0O, delayed 
2 sclks 


DmaE fsw_dma_EoP0_s2a End of Packet to DMA port RXO 


3[1 _s 
DmaToFswR xxx_trbc_CodeSamp3[0 dma_fsw_Rdy0_sla DMA ready for new packet from FSW on 
port RXO 


Cv0OFlrToFswMsnMode xxx_trbc_CodeValidO flrO_fsw_MissionMode MissionMode from FLR-0 
Cv1FltToFswMsnMode xxx_trbc_CodeValid1 flt0O_fsw_MissionMode MissionMode from FLT-0 


11.13.1.2 FSW Input Vector Trigger (Mux 0) 


These are the fields selected from data coming into the FSW when MuxSel=0. 


Class 
TrbvFswiMux0 


Attributes 


-ocla -trbv -trbvfswi 
Gina 


WO[31:0] FlrToFswInDat | flr0_fsw_InDat_s0a[63:60], Fields selected for data coming into the 
flr0O_fsw_InDat_s0a[35:8] 


FSW. 
W1(0] FlrToFswldle flrO_fsw_Idle_s0a Data from link is IDLE packet or Data 
packet 


11.13.1.3. FSW Input Vector Trigger (Mux 1) 


These are the fields selected from data coming into the FSW when MuxSel=1. 


Class 
TrbvFswiMuxl 


Attributes 


-ocla -trbv -trbvfswi 
Sigma 


WO[31:0] FlrToFswInDat | flr0_fsw_InDat_s0a[59:36], Fields selected for data coming into the 
flr0O_fsw_InDat_s0a|[7:0] 


FSW. 
W1(0] FlrToFswldle flrO_fsw_Idle_s0a Data from link is IDLE packet or DATA 
packet 


11.13.1.4 FSW Input Vector Trigger Mux 2 


These are the fields selected from data coming into the FSW when MuxSel=2. 


Class 


TrbvFswiMux2 
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Attributes 


-ocla -trbv -trbvfswi 


ak [ Mnemonic | Giana 


W0O[31:0] FswToFIrCtlDat flrO_fsw_InDat_s0a[59:36], Fields selected for data coming into the 
fsw_flrO_CtlDat_s3a[7:0] 


W1(0] FlrToFswldle flrO_fsw_Idle_s0a Data from link is IDLE packet or DATA 
packet 


Although fsw_flr0_CtlDat_s3a[7:0] is an output of FSW, it’s considered part of the “FLRO input interface” to 
FSW, so we provide it as an option in the FSW Input trigger block. 


11.13.1.5 FSW Input Vector Trigger Mux 3 
These are the fields selected from data coming into the FSW when MuxSel=3. 


Class 
TrbvFswiMux3 


Attributes 


-ocla -trbv -trbvfswi 


(Siena 


WO[31:0] DmaToFswInDat dma_fsw_InDat0_s0a[63:60], Fields selected for data coming into the 
dma_fsw_InDat0_s0a[35:8] 


W1(0] DmaToFswDatVal | dma_fsw_DatVal0_s0a Data from DMA engine is worth looking at 


11.13.1.6 FSW Input Vector Trigger Mux 4 
These are the fields selected from data coming into the FSW when MuxSel=4. 


Class 
TrbvFswiMux4 


Attributes 


-ocla -trbv -trbvfswi 


Cigna 


WO[31:0] DmaToFswInDat dma_fsw_InDat0_s0a[59:36], Fields selected for data coming into the 
dma_fsw_InDat0_s0a|[7:0] 


W1(0] DmaToFswDatVal | dma_fsw_DatVal0_s0a Data from DMA engine is worth looking at 


11.13.1.7 FSW Output Vector Trigger Mux 0 
These are the fields selected from data being driven from the FSW when MuxSel=0. 


Class 
TrbvFswoMux0 


Attributes 


-ocla -trbv -trbvfswo 
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Bi (Signa 


WO[31:0] | FswToFltOutDat | fsw_flt0_OutDat_s2a[63:60], 
fsw_flt0O_OutDat_s2a[35:8] 


W1(0] FswToFltIdle fsw_fltO_Idle_s2a Data from link is IDLE packet or DATA 
packet 


11.13.1.8 FSW Output Vector Trigger Mux 1 


These are the fields selected from data being driven from the FSW when MuxSel=1. 


Class 
TrbvFswoMuxl 


Attributes 


-ocla -trbv -trbvfswo 
a a 


WO[31:0] FswToFltOutDat | fsw_flt0_OutDat_s2a[59:36], 
fsw_flt0O_OutDat-_s2a[7:0] 


W1(0] FswToFltIdle fsw_fltO_Idle_s2a Data from link is IDLE packet or DATA 
packet 


11.13.1.9 FSW Output Vector Trigger Mux 2 


These are the fields selected from data being driven from the FSW when MuxSel=2. 


Class 
TrbvFswoMux2 


Attributes 


-ocla -trbv -trbvfswo 


(Sia 


WO[31:0] | FltToFswCtlDat | fsw_flt0_OutDat_s2a[59:36], 
fltO_fsw_CtlDat_s0a|[7:0] 


W1(0] FswToFltIdle fsw_fltO_Idle_s2a Data from link is IDLE packet or DATA 
packet 


Although flt0_fsw_fsw_CtlDat_s0a[7:0] is an input of FSW, it’s considered part of the “FLTO output interface” 
to FSW, so we provide it as an option in the FSW Output trigger block. 


11.13.1.10 FSW Output Vector Trigger Mux 3 
These are the fields selected from data being driven from the FSW when MuxSel=3. 


Class 
TrbvFswoMux3 


Attributes 


-ocla -trbv -trbvfswo 


Bik Sina 
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Bie Siena 


WO[31:0] FswToDmaOutDat | fsw.dma_OutDat0_s2a[63:6Q)],tbs 
fsw_dma_OutDat0_s2a[35:8] 


W1(0] FswToDmaDatVal fsw_dma_Dat Val0_s2a Data from DMA engine is worth looking at 


11.13.1.11 FSW Output Vector Trigger Mux 4 


These are the fields selected from data being driven from the FSW when MuxSel=4. 


Class 
TrbvFswoMux4 


Attributes 


-ocla -trbv -trbvfswo 


Siena 


WO[31:0] FswToDmaOutDat | fsw-dma_OutDat0_s2a[59:36],tbs 
fsw_dma_Out Dat0_s2a[7:0] 


W1(0] FswToDmaDatVal fsw_dma_Dat Val0_s2a Data from DMA engine is worth looking at 


11.13.2 FSW Collectors 


The FSW contains two CTBs, one for incoming data and one for outgoing data. The CTB for incoming data 
is connected to the same signals as the FSW Input Vector Trigger Block. The CTB for outgoing data is connected 
to the same signals as the FSW Output Vector Trigger Block. 
11.13.2.1 FSW Input Collectors Qualifying Triggers 
Class 


CtbFswiQtrig 


Attributes 


-ocla -ctb -ctbfswi 


Cigna 


DmaToFswDatVal | fsw.dma_fsw_DatVal0_s0a Qualify collection on Dma to Fsw data 
valid. 


}o. | FlrToFswldle fsw.flrO_fsw_Idle_s0a Qualify collection on Flr0 to Fsw Idle. 


11.13.2.2 FSW Input Collector Mux 0 
Class 
CtbFswiMux0 


Attributes 


-ocla -ctb -ctbfswi 
Signal) 


31:28 | FlrToFswDat6360 | fsw.flrO_fsw_InDat_s0a[63:60] Data from FLRO to FSW bits 63- 
60. 


FlrToFswDat358 fsw.flr0_fsw_InDat_s0a[35:8] Data from FLRO to FSW bits 35- 
8. 
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11.13.2.3 FSW Input Collector Mux 1 
Class 
CtbFswiMux1 


Attributes 


-ocla -ctb -ctbfswi 


Ciena 


FlrToFswDat5936 | fsw.flr0_fsw_InDat_s0a[59:36] Data from FLRO to FSW bits 59- 
36. 


FlrToFswDat70 fsw.flr0_fsw_InDat_s0a{7:0] Data from FLRO to FSW bits 7-0. 


11.13.2.4 FSW Input Collector Mux 2 
Class 
CtbFswiMux2 


Attributes 


-ocla -ctb -ctbfswi 


Ciena 


FlrToFswDat5936 | fsw.flr0_fsw_InDat_s0a[59:36] Data from FLRO to FSW bits 59- 
36. 


FswToFlrCtlDat | fsw.fsw_flr0_CtlDat_s3a[7:0] Control Data from FSW to FLRO. 


Although fsw_flr0_fsw_CtlDat_s3a[7:0] is an output of FSW, it’s considered part of the “FLRO input interface” 
to FSW, so we provide it as an option in the FSW Input collector block. 
11.13.2.5 FSW Input Collector Mux 3 
Class 

CtbFswiMux3 


Attributes 


-ocla -ctb -ctbfswi 


Canal 


31:28 | DmaToFswDat6360 | fsw.dma_fsw_InDat0_s0a[63:60] Data from DMA to FSW bits 63- 
60. 


DmaToFswDat358 fsw.dma_fsw_InDat0_s0a[35:8] Data from DMA to FSW bits 35- 
8. 


11.13.2.6 FSW Input Collector Mux 4 
Class 
CtbFswiMux4 


Attributes 


-ocla -ctb -ctbfswi 


Gina 
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Ciena 


DmaToFswDat5936 | fsw.dma_fsw_InDat0_s0a[59:36] Data from DMA to FSW bits 59- 
36. 


DmaToFswDat70 fsw.dma_fsw_InDat0_s0a|7:0] Data from DMA to FSW bits 7-0. 


11.13.2.7 FSW Input Collector Mux 5, 6, 7 


Gives you the same as Mux 4. 


11.13.2.8 FSW Output Collectors Qualifying Triggers 
Class 
CtbFswoQtrig 


Attributes 


-ocla -ctb -ctbfswo 


Cigna 


FswToDmaDatVal | fsw.fsw-dma_DatVal0_s2a Qualify collection on Fsw to Dma data 
valid. 


| 0 | FswToFitldle few. fsw_fitO_Idle_s2a Qualify collection on Fsw to FIt0 Idle. 


11.13.2.9 FSW Output Collector Mux 0 


Class 
CtbFswoMux0 


Attributes 


-ocla -ctb -ctbfswo 
Signal 


31:28 | FswToFltDat6360 | fsw.fsw_flt0_OutDat_s2a[63:60] Data from FSW to FLTO bits 63- 
60. 
FswToFltDat358 | fsw.fsw_flt0_OutDat_s2a[35:8] Data from FSW to FLTO bits 35- 
8. 


11.13.2.10 FSW Output Collector Mux 1 


Class 
CtbFswoMuxl 


Attributes 


-ocla -ctb -ctbfswo 


Ciena 


FswToFltDat5936 | fsw.fsw_fit0_OutDat_s2a[59:36] Data from FSW to FLTO bits 59- 
36. 


FswToFltDat70 fsw.fsw_fit0_OutDat_s2a[7:0] Data from FSW to FLTO bits 7-0. 


11.13.2.11 FSW Output Collector Mux 2 
Class 
CtbFswoMux2 
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Attributes 


-ocla -ctb -ctbfswo 


Giana 


FswToFltDat5936 | fsw.fsw_fit0_OutDat_s2a[59:36] Data from FSW to FLTO bits 59- 
36. 


FItToFswCtlDat | fsw.flt0_fsw_CtlDat_s0a[7:0] Control Data from FLTO to FSW. 


Although flt0_fsw_fsw_CtlDat_s0a[7:0] is an input of FSW, it’s considered part of the “FLTO output interface” 
to FSW, so we provide it as an option in the FSW Output collector block. 


11.13.2.12 FSW Output Collector Mux 3 
Class 
CtbFswoMux3 


Attributes 


-ocla -ctb -ctbfswo 
Cigna 


31:28 | FswToDmaDat6360 | fsw.fsw-dma_OutDat0_s2a[63:60] Data from FSW to DMA bits 63- 


FswToDmaDat358 fsw.fsw_dma_OutDat0_s2a[35:8] Data from FSW to DMA bits 35- 


11.13.2.13 FSW Output Collector Mux 4 
Class 
CtbFswoMux4 


Attributes 


-ocla -ctb -ctbfswo 


Gina 


FswToDmaDat5936 | fsw.fsw_dma_OutDat0_s2a[59:36] Data from FSW to DMA bits 59- 
36. 


FswToDmaDat70 fsw.fsw_dma_OutDat0_s2a|7:0] Data from FSW to DMA bits 7-0. 


11.13.2.14 FSW Output Collector Mux 5, 6, 7 


Gives you the same as Mux 4. 


11.14 OCLA in use —- DMA 


11.14.1 DMA Triggers 


The DMA engine has a CSW Bus Stop trigger and collector unit, one vector trigger unit and one capture block. 
The inputs to the TRBV and the CTB are muxed from a set of 128 signals. The CSW side of the DMA engine is 
connected to a TRBC unit with connections shown in Section 11.14.1.1. 


11.14.1.1 DMA Codeword Triggers 


DMA Engine to Central Switch codeword triggers. 
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-ocla -trbe -trbcdma 


(Codeword Sample Input) | (Signal 


wale 
xxx_trb_ 4 


= 
wr 
iS 


Pi cto tal | 
wats] [| vrs Coaesampatal 
wat 
wal 
wai 
wal 
wo 
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m_csw-dma_Command_c2a|[4:0] 


m_csw_dma_CmdAddrTID_c2a[4:0 


m_csw_dma_DataTID_c4a[4:0] 


dma_csw_ECmdAddrReq_cla 
dma_csw_OCmdAddrReq-cla 
csw_dma_CmdAddrGnt_c2a 
m_csw_dma_CmdAddrValid_c2a 


m_csw_dma_DataValid_c4a 


| 


Definition 

The incoming command code 
The incoming command TID 
The incoming data TID 
Reserved (drive to ’0’) 


Reserved (drive to ’0’) 


Even bound command request 
Odd bound command request 
Comand grant 
Comand/transfer is valid 


Data is valid 


The input to the TRBV is selected as shown in Sections 11.14.1.2, 11.14.1.3, 11.14.1.4, and 11.14.1.5. The 
TRBV trb_xxx_MuxSel_xa[1:0] outputs select from among the four groups. The TRBV has one CodeValid input, 


connected to m_ue_xxx_DbgValid_c2a. 


11.14.1.2 DMA Vector Trigger Inputs (Mux 0) 


DMA Engine transmit and receive port buffer status. 


Class 
TrbvDmaMux0 


Attributes 


-ocla -trbv -trbvdma 


Mnemonic Signal) 


p0_ue_BufAvail.cla 


pl_ue_BufAvail.cla 


Txp0ToUEngBufA vai txp0_ue_BufAvailcla 


ue_rxp1_BufTransfer_c5a 


UEngToRxp1BufXfr 


WO[31 
W0[30 
WoO[29 
WO[28 
WO[27 
W0O[26 Txp1ToUEngBufA vai txpl_ue_BufAvailcla 
WO[25 
WoO[24 
WO[23 
WO[22 
Wo[21 
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Receive port 0 to microengine buffer avail- 
able 

Receive port 1 to microengine buffer avail- 
able 

p2_ue_BufAvail_cla Receive port 2 to microengine buffer avail- 
able 


UEngRxThreadStart copy_ue_RxThreadStart_cla | Microengine receive thread start 


Transmit port 0 to microengine buffer avail- 
able 

Transmit port 1 to microengine buffer avail- 
abl 


ble 
Txp2ToUEngBufA vai txp2_ue_BufAvailcla Transmit port 2 to microengine buffer avail- 
able 


UEngTxThreadStart copy-_ue_TxThreadStart_cla | Microengine transmit thread start 


UEngToRxp0BufXfr ue_rxp0_BufTransfer_c5a Microengine to receive port 0 buffer trans- 
fer 


Microengine to receive port 1 buffer trans- 
fer 


UEngToRxp2BugXfr ue_rxp2_BufTransfer_c5a Microengine to receive port 2 buffer trans- 
fer 
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Niner Signa 
UEngRxThreadDone ue_copy_RxThreadDone_c5a | Microengine receive thread done 


UEngToTxp0BufXfr ue_txp0_BufTransfer_c5a Microengine to transmit port O buffer 
UEngToTxp1BufXfr ue_txp1_BufTransfer_c5a Microengine to transmit port 1 buffer 


Microengine to transmit port 2 buffer 


ransfer 


EngDbgPc 


EngDbgValid ue_xxx_Dbg Valid_c2a Microengine Debug Valid Flag [See Note 1] 


Note 1: W1]/0] = ue_xxx_DbgValid_c2a was a mistake, asserts 2 cycles before the other Dbg signals, it should 
have been ue_xxx_DbgValid_c4a. But since ue_xxx_DbgValid_c4a is available as one of the triggers, you can still 
achieve qualification by DbgValid by just including W0[14] (uexxx_DbgValid_c4a) == 1 as part of the equation 
for a match. 


11.14.1.3. DMA Vector Trigger Inputs (Mux 1) 


DMA Engine transmit and receive port reference counts. 


Class 
TrbvDmaMux1 


Attributes 


-ocla -trbv -trbvdma 


0 
_-txp2_Re _ 


CifToTxp2RefCntZero cif_tx (CntZero_cda Transmit port 2 reference count is zero 
iffxRefCntZero cif_copy_TxRefCntZero_c5a | Copy transmit reference count is zero 
ifToUEngStartlo cif_ue_StartIo_cla Microengine IO Start 


EngToCifTaskThread ue_cif_TaskThread_c5a[3:0] 
7 
EngDbgValid ue_xxx_Dbg Valid_c2a 


11.14.1.4 DMA Vector Trigger Inputs (Mux 2) 
DMA Engine’s central switch to transmit/receive port interfaces. 
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Class 


TrbvDmaMux2 


Attributes 


-ocla -trbv -trbvdma 


MemInPbufSel 
MemInRmbSel 


Cif 
Ci 
Ci p 


MemInTxp0Sel ci 
MemInTxp1Sel ci 
MemInTxp2Sel ci 


(Signal) 


io) 
ie} 


vy. 
ry 


ie) 
ie} 


ct 
ta 


ts] 
iz 


emInTxp 


opy-MemOutP 


fe) 
1 
° 


C1 Ge a.) co. 
if I I i) i I i) 
cf 
* ° eo) 
jo} je} i jo} 
Zz} Ze]! zlZz/Z/c]i 


ie) 
ie} 
eo} 


ry. 
emOutRx 
emOutRx 
emOutRx 


ci 


- 
* 


ci 


* 


ci 


Ly 
* 


MemInPbufSel_c4a 
MemInRmbSel_c4a 
emInTxp0SelLc4a 


Selc4a 


emInTxp2SelLc4a 


bufSeLc2a 


MemOutWmbSel.c2a 


p0Sel_c2a 
plSeLc2a 
p2Selc2a 


cif_rxp_MemOutCo 


ue€_XXxX_ 


11.14.1.5 DMA Vector Trigger Inputs (Mux 3) 


DMA Engine internal memory writes. 


Class 


TrbvDmaMux3 


Attributes 


-ocla -trbv -trbvdma 
Gena 


WO[31] DmemResultSel ue_dmem_ResultSelc5a Asserted when dmem is written by an in- 
struction 


pySel_c2a 


_xxx_MemInAlign_c4a[3:0] 
_xxx_MemInAddr-_c4a{7:0] 
DbgValid_c2a 
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Microengine Debug Valid Flag 


WO[30:21] DmemResultAddr | ue_xxx_ResultAddricda Address in dmem where ALU result is writ- 
ten 


W0O[20:0] DmemResultData | alu_xxx_ResultDat_c5a[20:0] | ALU result to be written to dmem 
W1(0] UEngDbgValid ue_xxx_DbgValid_c2a Microengine Debug Valid Flag 


11.14.2 DMA Collector 


The DMA engine has a single 1024 x 33 bit CTB. Its inputs are configured identically to those for the vector 
TRB in the DMA engine. (See Tables 11.14.1.2, 11.14.1.3, 11.14.1.4 and 11.14.1.5.) 


11.14.2.1 DMA Input Collectors Qualifying Triggers 


The CTB has two qualifier inputs. Qtrig[1] is connected to ue_xxx_DbgValid_c2a, and Qtrig[0] is connected to 
ue_cif_TaskStart_cda. 


Class 
CtbDmaQtrig 
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Attributes 


-ocla -ctb -ctbdma 


Signal) 


UEngDbgValid dma.csr.ue_xxx_DbgValid_c2a | Microengine Debug Valid Flag [Broken, see 
Note 1] 


0: UEngToCifTaskStart | dma.csr.ue_cif._TaskStart_c5a Microengine To CIF Task Start 


Note 1: 
This is broken, it should have been connected to ue_xxx_DbgValid_c4a in order to allow us to collapse collection 


of “Dbg” signals. With it connected to ue_xxx_DbgValid_c2a we effectively cannot use this collection qualifier at 
all. 


11.14.2.2 DMA Input Collector Mux 0 
Class 
CtbDmaMux0 


Attributes 


-ocla -ctb -ctbdma 


Bit_| Mnemonte Gignal) 


[Mnemonic | (Si 
31 Rxp0ToUEngBufAvail | dma.rxp0_ue_BufAvail_cla Receive port 0 to microengine buffer 
ee ee 
30 Rxp1ToUEngBufAvail | dma.rxpl_ue_BufAvailcla Receive port 1 to microengine buffer 
availab 


e 
28 UEngRxThreadStart dma.copy_ue_RxThreadStart_cla Microengine receive thread start 


27 Txp0ToUEngBufAvail | dma.txp0_ue_BufAvailcla Transmit port 0 to microengine buffer 
available 
Txp1ToUEngBufAvail | dma.txpl_ue_BufAvailcla Transmit port 1 to microengine buff 
available 


Txp2ToUEngBufAvail | dma.txp2_ue_BufAvailcla Transmit port 2 to microengine buff 


availa 


ble 
UEngTxThreadStart dma.copy_ue_TxThreadStart_cla Microengine transmit thread start 
U 


23 EngToRxp0BufXfr dma.ue_rxp0_BufTransfer_c5a Microengine to receive port O buffer 

pee nee ee Wee | 

22 UEngToRxp1BufXfr dma.ue_rxp1_BufTransfer_c5a Microengine to receive port 1 buffer 
transfer 


20 UEngRxThreadDone ma.ue_copy-RxThreadDone_cda Microengine receive thread done 
U ; 


d = 
19 EngToTxp0BufXfr dma.ue_txp0_BufTransfer_c5a Microengine to transmit port 0 buffer 
transfer 


18 UEngToTxp1BufXfr dma.ue_txp1_BufTransfer_c5a Microengine to transmit port 1 buff 
transfer 
UEngToTxp2BufXfr ma.ue_txp2_BufTransfer_c5a Microengine to transmit port 2 buffer 
transfe 


d 
rt 
é ad 


ma.csr.m_ue_xxx_DbgThread_c4a[3:0] | Microengine thread number 
EngDbgPc dma.ue_xxx_DbgPc_c4a[9:0] Microengine PC 
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11.14.2.3. DMA Input Collector Mux 1 


Class 
CtbDmaMux 


Attributes 


1 


-ocla -ctb -ctbdma 


11.14. OCLA IN USE —- DMA 


Sina 


cif_rxp0_RefCntZero_c5a 
cif_rxp1_RefCntZero_c5a 


24 
23 
22 
21 
20 
19 
18 
17 
16 
15:14 
13:10 


fs | UEngToCifTaskStart ue_cif_TaskStart_c5a 
UEngToCifTaskThread ue_cif_TaskThread_c5a[3:0] 


w_dma_CmdOrigin_cla Origin of CSW command 


CifToRx 
CifToRx 
CifToRx 
CifRxRe 
CifToTxp0RefCntZero 
CifToTxp1RefCntZero 
CifToTxp2RefCntZero 
CifTxRefCntZero 
CifToUEngStartIo 
CifToUEngStartloType 
CifToUEngStartloAddr 
UEngToCifRdyForStartlo 


pORefCntZero 
plRefCntZero 
p2RefCntZero 
‘CntZero 


Re 


if_rxp2. 
if_txp1Re 


cifrx (Cnt Zero_c5a 


ci (Cnt Zero_c5a 


Re 
cif_copy_TxRefCntZero_c5a 


cif_txp2_RefCntZero_c5a 


zlz 


Receive port 0 reference count is zero 
Receive port 1 reference count is zero 
Receive port 2 reference count is zero 
Copy receive reference count is zero 
Transmit port 0 reference count is zero 
Transmit port 1 reference count is zero 
Transmit port 2 reference count is zero 


Copy transmit reference count is zero 


icroengine IO Start 
icroengine IO Start Type 
icroengine IO Start Address 
icroengine ready for Start IO 
icroengine task start 


icroengine task thread 


11.14.2.4 DMA Input Collector Mux 2 


Class 
CtbDmaMux 


Attributes 


2 


-ocla -ctb -ctbdma 


(Sia 


22 


21 
20 
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1 
1 
1 


CifMemInPbufSel 


CifMemInRmbSel 


—C 
COPY. 
_txp_ 
_txp_ 
f_rxp_. 
_rxp_. 
_IxXp_ 
xpe 


i - 
CifToRxpMemOutRxp1Sel | ci emOutRxp1Selc2a 
CifToRxpMemOutRxp2Sel | ci emOutRxp2Sel_c2a 


p-M 
p_M 
p. 

c a 
p-M 
p_M 
p-M 


i 
CifToRxpMemOutCopySel ci emOutCopySel_c2a 
CifMemInAlign cif_xxx_MemInAlign_c4a[3:0] 
CifMemInAddr cif_xxx_MemInAddr_c4a|[7:0] 


581 
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11.14.2.5 DMA Input Collector Mux 3 
Class 
CtbDmaMux3 


Attributes 


-ocla -ctb -ctbdma 


Siena 


31 DmemResultSel ue_dmem_ResultSel_c5a Asserted when dmem is written by an in- 
struction 


30:21 DmemResultAddr ue_xxx_Result Addr_c5a Address in dmem where ALU result is writ- 
ten 


DmemResultData alu_xxx_ResultDat_c5a[20:0] | ALU result to be written to dmem 


11.14.2.6 DMA Input Collector Mux 4, 5, 6, 7 


Collects all-zeros. 


11.15 OCLA in use — PMI 


11.15.1 PMI/PCI/BBS Triggers 

The PMI/PCI/BBS contains two codeword trigger units. The first trigger unit is on its CSW bus stop and the 
second trigger unit is for signals internal to the PMI. 
11.15.1.1 “TrbcPmi” PMI CSW Bus Stop Codeword Triggers 


The CSW side of the PMI is connected to the first TRBC unit with connections as shown below. This “TrbcPmi” 
is “trbc0” in the Verilog source code file PmiOcl.v. Note that for all TRBs, word x (that is WO, W1, W2, W3) 
maps to CodeSampX (CodeSamp0, CodeSampl... respectively.) W4 and W5 map to the two CodeValid inputs. 

No external mux is used, there is only one set of signals wired to this trigger block. Field “ExtMuxSel” of 
R_TrbcPmiTrigCtl has no effect, can be left unchanged, or written to any value. 


Class 
TrbcPmi 


Attributes 


-ocla -trbe -trbcpmi 


(Codeword Sample Input) | Gignal 
WO[4:0] CswToPmiCommand xxx_trbc_CodeSamp0_x0a[4:0] csw_pmi_Command_cla nbound Command Code from CSW 


W1[4:0] CswToPmiCmdAddrTID xxx_trbc_CodeSamp1_x0a[4:0] csw_pmi_CmdAddrTID_cla nbound Request Transaction ID from 
CSW 


CswToPmiDataTID xxx_trbc_CodeSamp2_x0a[4:0] csw_pmi_DataTID_c3a nbound Data Transaction ID from CSW 


W2 _pmi 
PmiToCswECmdAddReq xxx_trbc_CodeSamp3_x0a[4 pmi_csw_ECmdAddrReq_c0a Outbound to COHE Command Request 
from PCI 


Cv0CswToPmiCmdAddrValid xxx_trbc_CodeValid0_x0a csw_pmi_CmdAddrValid_cla Command/Transfer Valid, CSW is sending 
cmd to PCI 
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CswToPmiCmdAddrGnt xxx_trbc_CodeSamp3_x0a[2 csw_pmi_CmdAddrGnt_cla nbound Command Grant to PCI 


PmiToCswOCmdAddrReq xxx_trbc_CodeSamp3_x0a[3 pmi_csw_OCmdAddrReq_c0a | Outbound to COHO Command Request 
from PCI 
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(Codeword Sample Taput) | tae 


W5(0] Cv1CswToPmiDataValid xxx_trbc_CodeValid1_x0a csw_pmi_DataValid_c3a CSW is sending data to PCI 


11.15.1.2 “TrbcPmii” PMI Internal Signal Codeword Triggers 


The following PMI internal signals are connected to the second TRBC unit as shown below. This “TrbcPmii” 
is “trbcl” in the Verilog source code file PmiOcl.v. 

No external mux is used, there is only one set of signals wired to this trigger block. 

Due to Bug 1959, affecting PMI only in Ice9A, the ExtMuxSel field of R-TrbcPmiiTrigCtl is the mux-select of 
input signals for PMI’s CTB! This is fixed in Ice9B. 

The value 7 has no special “power-savings” meaning like in other units. In PMI it selects a set of signals to 
collect. Field ExtMuxSel in R_CtbPmiColCtl does nothing, can be left unchanged or set to any value. 


Class 
TrbcPmii 


Attributes 


-ocla -trbe -trbcpmi 


(Codeword Sample tapes) | Sia) 
A 


SycToCcrRdHadrValid xxx_trbc_CodeSamp0_x0a[1 m_RdHdrVal_cla Flopped syc_ccr_RdHdrValc0a valid bit for 
header 


CmdInProcess xxx_trbc_CodeSam m_CommandInProcess_cla | A command is being processed 


SycToCcwWrHadrValid xxx_trbc_CodeSam syc_ccw_WrHdrValc0a tbs (flopped one more time than 
m_WrSmState_cla) 


bc_CodeSam 3:0] | m_WrSmState_cla[3:0] 


: 


peGodevataomoaf) [i Ae Cid 
peCodevatotxoaf) [de SSCSC—~*Y 


11.15.2 PMI/PCI/BBS Collector 


The PMI/PCI/BBS contains one 1024 x 33 bit CTB, with an external mux to select sets of signals to collect. 


Due to Bug 1959, affecting PMI only in Ice9A, the ExtMuxSel field of R-TrbcPmiiTrigCtl is the mux-select of 
input signals for PMI’s CTB! This is fixed in Ice9B. 


The value 7 has no special “power-savings” meaning like in other units. In PMI it selects a set of signals to 
collect. Field ExtMuxSel in R-CtbPmiColCtl does nothing, can be left unchanged or set to any value. 


11.15.2.1 PMI Input Qualifying Triggers 
Class 


CtbPmiQtrig 
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Attributes 


-ocla -ctb -ctbpmi 
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Qtrig1 Always1 xxx_ctb_QualTrigger1_x0a Always at 71’ 
QtrigOAlwaysl xxx_ctb_QualTrigger0_x0a Always at ’1’ 


11.15.2.2 PMI Input Collector Mux 0 


Class 
CtbPmiMux0 


Attributes 


-ocla -ctb -ctbpmi 


(CPB Inpwe) (Signa 


SycToCcrRdHdrVal xxx_ctb_SampleDataIn0_x0a[28] m_RdHdrVal_cla 


27:24 SycToCcrRdLastBeO 
23:20 SycToCcrRdFirstBeO 


SycToCcrRdDwLen0O 


HdrPop0O 


xxx_ctb_Samp 
xxx_ctb_Samp 


xxx_ctb_Samp 


xxx_ctb_Samp 
xxx_ctb_Samp 
xxx_ctb_Samp 


xxx_ctb_Samp 


xxx_ctb_Samp 


xxx_ctb_Samp 


eDataIn0_x0a 
eDataIn0_x0a 
eDataIn0_x0a 


eDataIn0_x0a 
eDataIn0_x0a 
eDataIn0_x0a 
eDataIn0_x0a 


eDataIn0_x0a 
eDataIn0_x0a 
eDataIn0_x0a 
eDataIn0_x0a 


11.15.2.3. PMI Input Collector Mux 1 


Class 
CtbPmiMux1 


Attributes 


-ocla -ctb -ctbpmi 


for 
27:24] | m_RdLastBe_cla[3:0] 
23:20] | m_RdFirstBe_cla|[3:0] 


19:10] | m_RdDwLen_cla{9:0] 


9 


0 


m_Buf2Busy_c6a 
m_Buf1Busy_c6a 
m_Buf0Busy_c6a 


_Servicing0_c7a 
m_CommandInProgress_cla 


m_RdSmState_cla[1:0] 


1. 
Ss 
Ss 
Ss 
m. +3, Ss 
m. )_ Ss 
S 
Ss 


1:0] 


Flop 


Flo 


Flopped syc_ccr_RdFirst Be_c0a[3:0] 


Flop 


m_CcrSycRdHdrPop_c2a Flop 


ped syc_ccr-_RdHdrVal_cOa, valid bit 
header 


pped syc_ccr_RdLast Be_c0a[3:0] 


ped syc_ccr_RdDwLen_c0a{9:0] lower 


bits of 11. 


ped ccr_syc_RdHdrPop_cla 


| 
D 


Ces a 


oct 


CewToSycWrHdrPop1 xxx_ct x0a 
CcwWrSeqNum xxx_ctb_SampleDataIn1_x0a[11:5] ccw_xxx_WrSeqNum_cla[6:0] bs, the lowest 7 bits of 11-bit WrSeqNum 
x0. 


CmdBusy xxx_ct 
= 
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b_Samp 


b.Samp 


b.Samp 


b.Samp 


eDatalIn 
eDatalIn 


eDatalIn 


eDatalIn 


-x0a 


a 


12] ccw_syc_WrHdrPop-cla 
14 


4] m_CmdBusy-c2a t 
a 


DS 


Write Header Valid 


22:13] | syc_ccw_WrDwLen_c0a{9:0] tbs, the lowest 10 bits of 11-bit WrDwLen 
tbs, , mae 


bs (flopped one less time than signals 
bove) 


bs (flopped one less time than signals 


3:0] m_WrSmState_cla[3:0] 4 
a 
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11.15.2.4 PMI Input Collector Mux 2 
Class 
CtbPmiMux2 


Attributes 


-ocla -ctb -ctbpmi 


xxx_ctb_SampleDataIn2_x0a ccw_syc_WrHdrPop_cla 


tbs 
xxx_ctb_SampleDataIn2_x0a[26:23] | syc_ccw_WrLastBe_cOa[3:0] 
a 


11.15.2.5 PMI Input Collector Mux 3 
Class 


CtbPmiMux3 


Attributes 


-ocla -ctb -ctbpmi 


Paras [ id mth Somplebaeameveeponemy [id Reed 
xxx_ctb_SampleDataIn3_x0a[26] syc_ccr_RdHalt_c0a tbs 

xxx_ctb_SampleDataIn3_x0a[25:19 syc_ccr_RdSeqNum_c0a[6:0] 
xxx_ctb_SampleDataIn3_x0a[18] syc_ccr_RdHdrVal_c0a 
xxx_ctb_SampleDataIn3_x0a[17:14 syc_ccr_RdLast Be_c0a[3:0] 
xxx_ctb_SampleDataIn3_x0a[13:10 syc_ccr_RdFirstBe_cO0a|[3:0] 
xxx_ctb_SampleDataIn3_x0a[9:0] syc_ccr_RdDwLen_c0a[9:0] 


11.15.2.6 PMI Input Collector Mux 4 
Class 
CtbPmiMux4 


Attributes 


-ocla -ctb -ctbpmi 


(CHB Tapa) Siena) 
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(Cs aw Tiana 
CxdToRrfCmd xxx_ctb_SampleDataIn4_x0a[12:8] cxd_rrf.Cmd_c4a|[4:0] tbs 
CxdToRrfCmdOrigin xxx_ctb_SampleDataIn4_x0a]7:4] exd_rrf_CmdOrigin_c4a[3:0] | tbs 


xxx_ctb_SampleDataIn4_x0a[3] tbs 
xxx_ctb_SampleDataIn4_x0a[0] tbs 


11.15.2.7 PMI Input Collector Mux 5 
Class 


CtbPmiMux5 


Attributes 


-ocla -ctb -ctbpmi 


(CEB Tapa Signal 
P30} xxx_ctb_SampleDataIn5_x0a[31:0] a Reserved, all zeros 


11.15.2.8 PMI Input Collector Mux 6 
Class 
CtbPmiMux6 


Attributes 


-ocla -ctb -ctbpmi 


xxx_ctb_SampleDataIn6_x0a[31:25] tbs 
A 


Note 1: 

In the verilog RTL, cxd_ccm_Dest_c4a is [3:0], but in the behavioral model it’s [4:0]. In the behavioral model 
xxx_ctb_SampleDataIn6_x0a[9] is connected to cxd_ccm_Dest_c4a|4], although it should always simulate with this 
bit = 0. 


11.15.2.9 PMI Input Collector Mux 7 
Class 
CtbPmiMux7 


Attributes 


-ocla -ctb -ctbpmi 
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Piso [| *d meth SempteDatatarmeontin] [i Revered 
Pa | warcroPmnviaiback | sooueth-SampleDatatat=xons] | vartpmicwbadkce | UART to PMIWitibow Ack 


[a [ Pastovarwinbstrobe | soneth-SempleDatala7=xon | pmicuartwhStiobe.c | PMIto UART Wishbone Suobe 


11.16 Register Address Ranges 


11.16.1 TrbcPs0 
Register 
R_TrbcPs0* : R-Trbex* 


Address 
0xE_0C00_0000-0xE_OCFF_FFFF 


11.16.2 TrbcPsl 
Register 
R_TrbcPs1* : R-Trbcx* 


Address 
0xE_1C00_0000-0xE_1CFF_FFFF 


11.16.3. TrbcPs2 
Register 
R_TrbcPs2* : R_Trbcx* 


Address 
OxE_2C00_0000-0xE_2CFF_FFFF 


11.16.4 TrbcPs3 
Register 
R_TrbcPs3* : R-Trbcx* 


Address 
0xE_3C00_0000-0xE_3CFF_FFFF 


11.16.5 TrbcPs4 
Register 
R_TrbcPs4* : R_Trbcx* 
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Address 
OxE_4C00_0000-0xE_4CFF_FFFF 


11.16.6 TrbcPs5 
Register 
R_TrbcPs5* : R-Trbex* 


Address 
0xE_5C00_0000-0xE_5CFF_FFFF 


11.16.7 TrbcPs6 
Register 
R_TrbcPs6* : R_Trbcx* 


Attributes 
-Product=TWC9A+ 


Address 
OxE_4900_0000-0xE_49FF_FFFF 


11.16.8 TrbcPs7 
Register 
R_TrbcPs7* : R_-Trbcx* 


Attributes 
-Product=TWC9A+ 


Address 
0xE_5900_0000-OxE_59FF_FFFF 


11.16.9 TrbcPs8 
Register 
R_TrbcPs8* : R-Trbcx* 


Attributes 
-Product=TWC9A+ 


Address 
0xE_6900_0000-0xE_69FF_FFFF 


11.16.10 TrbcPs9 
Register 
R_TrbcPs9* : R_Trbcex* 
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Attributes 
-Product=TWC9A+ 


Address 
OxE_7900_0000-OxE_79FF_FFFF 


11.16.11 TrbcDma 
Register 
R_TrbcDma* : R_Trbcx* 


Address 
0xE_6C00_0000-0xE_6CFF_FFFF 


11.16.12 TrbvDma 
Register 
R_TrbvDma* : R_Trbvx* 


Address 
OxE_7C00_0000-0xE_7CFF_FFFF 


11.16.13. TrbcPmi 
Register 
R_TrbcPmi* : R_Trbcx* 


Address 
0xE_OF00_0000-0xE_OFFF_FFFF 


11.16.14 TrbcPmii 
Register 
R_TrbcPmii* : R-Trbcx* 


Address 


OxE_4F00_0000-0xE_4FFF_FFFF 


11.16.15 TrbcCoho 
Register 
R_TrbcCoho* : R_Trbcx* 


Address 
0xE_3A00_0000-0xE_3AFF_FFFF 


11.16.16 TrbcCohe 
Register 
R_TrbcCohe* : R_Trbcx* 
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Address 
OxE_2A00_0000-0xE_2AFF_FFFF 


11.16.17 TrbvFswo 
Register 
R_TrbvFswo* : R_Trbvx* 


Address 
OxE_1F00_0000-0xE_1FFF_FFFF 


11.16.18 TrbvFswi 
Register 
R_TrbvFswi* : R-Trbvx* 


Address 
OxE_2F00_0000-0xE_2FFF_FFFF 


11.16.19 TrbcFsw 
Register 
R_TrbcFsw* : R_Trbcx* 


Address 
OxE_3F00_0000-0xE_3FFF_FFFF 


11.16.20 CtbPs0O 
Register 
R_CtbPs0* : R-Ctbx* 


Address 
0xE_0B00_0000-0xE_OBFF_FFFF 


11.16.21 CtbPs1l 
Register 
R_CtbPs1* : R_Ctbx* 


Address 
0xE_1B00_0000-0xE_1BFF_FFFF 


11.16.22 CtbPs2 
Register 
R_CtbPs2* : R_Ctbx* 
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Address 
OxE_2B00_0000-0xE_2BFF_FFFF 


11.16.23 CtbPs3 
Register 
R_CtbPs3* : R-Ctbx* 


Address 
0xE_3B00_0000-0xE_3BFF_FFFF 


11.16.24 CtbPs4 
Register 
R_CtbPs4* : R_Ctbx* 


Address 
OxE_4B00_0000-0xE_4BFF_FFFF 


11.16.25 CtbPs5 
Register 
R_CtbPs5* : R_Ctbx* 


Address 
0xE_5B00_0000-0xE_5BFF_FFFF 


11.16.26 CtbPs6 
Register 
R_CtbPs6* : R-Ctbx* 


Attributes 
-Product=TWC9A+ 


Address 
OxE_4100_0000-0xE_41FF_FFFF 


11.16.27 CtbPs7 
Register 
R_CtbPs7* : R_Ctbx* 


Attributes 
-Product=TWC9A+ 


Address 
OxE_5100_0000-0xE_51FF_FFFF 


May 14, 2014 591 


11.16. REGISTER ADDRESS RANGES 


Rev 51328 


SiCortex Confidential 


11.16.28 CtbPs8 
Register 
R_CtbPs8* : R_Ctbx* 


Attributes 
-Product=TWC9A+ 


Address 
OxE_6100_0000-0xE_61FF_FFFF 


11.16.29 CtbPs9 
Register 
R_CtbPs9* : R_Ctbx* 


Attributes 
-Product=TWC9A+ 


Address 
OxE_7100_0000-OxE_71FF_FFFF 


11.16.30 CtbDma 
Register 
R_CtbDma* : R_Ctbx* 


Address 
0xE_6B00_0000-0xE_6BFF_FFFF 


11.16.31 CtbPmi 
Register 
R_CtbPmi* : R_Ctbx* 


Address 
0xE_7B00_0000-0xE_7BFF_FFFF 


11.16.32 CtbCoho 
Register 
R_CtbCoho* : R.Ctbx* 


Address 
OxE_1A00_0000-0xE_1AFF_FFFF 


11.16.33 CtbCohe 
Register 
R_CtbCohe* : R_Ctbx* 
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Address 
O0xE_0A00_0000-0xE_OAFF_FFFF 


11.16.34 CtbFswi 
Register 
R_CtbFswi* : R_Ctbx* 


Address 
OxE_4A00_0000-0xE_4AFF_FFFF 


11.16.35 CtbFswo 
Register 
R_CtbFswo* : R_Ctbx* 


Address 
0xE_5A00_0000-0xE_5AFF_FFFF 


11.17 OCLA Programming Suggestions 


11.17.1 Ready-To-Use OCLA Scripts 


Available scripts for using OCLA are documented in: <project>/specs/diags/DiagnosticOCLA.lyx 

Some pre-written OCLA scripts for the diagnostics “dash” environment are in: <project>/diags/ocla_test / 

These allow you to use OCLA with a few short commands in simple cases where a per-unit trigger is not needed. 

For easy diagnostics dash control of OCLA, whether using the above-mentioned scripts, or your own configura- 
tion, look in: <project>/diags/ocla/ 


11.17.2 Example Code for OCLA 


For examples of OCLA programming, look at the simulation tests we wrote to verify OCLA. 

Most of the OCLA simulation tests are listed and described on Wiki page: http: //apollo.sicortex.com/swiki/OclaVerific 

Commands to simulate these tests are (under svn rev control) in: <project >/hw/tests/testlists /ocla_use.vtest 

Source code (under svn rev control) is in directory: <project>/sw/anthrax/tests/ocla/ 

Each overall OCLA “program” in this directory requires 2 files and has 3 major parts. For example, test 
“ocla_ps3_tle2q_biuwr” is coded in files ocla_ps3_tlc2q_biuwr.c and ocla_ps3_t1c2q_biuwr_util.cpp. The —_util.cpp 
file contains 2 parts, the upper part creates the OCLA LAC program, and the lower part defines the values to 
write into OCLA configuration registers before the LAC program would be run. The .c file is the test, an Anthrax 
program to be loaded into PS-0 and PS-3 (in this case), which will configure OCLA registers, load the LAC program, 
start the LAC program, and create appropriate Ice9 activities so that this particular OCLA configuration and LAC 
program will trigger-on and collect interesting data. 


11.17.38 Use Our Examples on a Real Machine 


The OCLA configuration of any <project>/sw/anthrax/tests/ocla/ simulation test can easily be converted 
into a diagnostics dash perl script for use on a real machine. 
Instructions for how to convert the OCLA configuration (LAC program plus register configurations) of any of 
these simulation tests into a diagnostics dash script are found in <project>/sw/anthrax/tests/ocla/README, 
and consists of a quick make command. What you do is go to <project >/sw/anthrax/tests, the directory above 
where the tests are, and type “make ocla/<base_name>_cfg.pl”, where <base_name> is the part of the filename end- 
ing in _util.cpp that’s before the _util.cpp. The resulting perl script shows up in the <project >/sw/anthrax/tests/ocla/ 
directory. 
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If you need an OCLA configuration different than what’s found there, find one of the *_util.cpp files that’s close 
to what you need, copy it to a new <something>_util.cpp name, change it to do what you want, and run the make 
command to get a dash script. 


11.17.4 Create Your Own Counter 


You can use OCLA as 1 or 2 highly-configurable counters. 

In this use of OCLA, Collector Blocks are unused, CollectTrace never turned-on. A LAC program is needed, 
but it’s fairly simple for the one-counter case. OCLA’s two counters are in the LAC unit, incremented by LAC 
program instructions. 

To create one counter, configure a trigger from any signal or combination of signals, and then write a LAC 
program that has a tight 1-state loop that increments the counter whenever the trigger is asserted. This counter 
has 12-bits in Ice9A, 16-bits in Ice9B. 

To create two counters in Ice9B, configure 2 triggers, and have that 1-state tight loop increment one counter if 
one trigger is true, the other counter if the other trigger is true, and both counters if both triggers are true. 

Creating two counters in Ice9A is less accurate, because Ice9A doesn’t have the INCRBTH instruction, so if 
both triggers are true, you can only increment one of the counters. 

To get one larger counter you can effectively concatenate the two available counters by having nested loops in 
the LAC program. This gives you 24 bits in Ice9A, 32 bits in Ice9B. When you nest the 2 counters there’s a chance 
of tiny inaccuracies in the count because the LAC program has to ignore a potential event when clearing the lower 
counter, each time the lower counter rolls over and increments the higher counter. 


11.17.4.1 You might prefer SCB Performance Counters 


Because counting in OCLA requires a LAC program, it may be easier to feed the signals or triggers to SCB 
Performance Counters, and do the counting there. SCB Performance Counters are 32 bits whereas OCLA counters 
are only 16 bits (12 bits in Ice9A). 

SCB Performance Counters is pretty powerful. If you wish to count one trigger qualified by another, SCB 
Performance Counters can do that. If you wish to count one trigger qualified by a delayed or advanced version of 
another trigger, SCB Performance Counters can do that, with the delays being applied in OCLA LAC before the 
triggers are sent to SCB Performance Counters. 

One motivation to count in OCLA rather than SCB Performance Counters is that SCB Performance Counters 
has black-out periods (missing counts) whenever an SCB write or read is in progress. 

Another motivation to create a counter in OCLA is if SCB Performance Counters is already in use, or if you 
wanted more than 2 continuously-counting counters. 2 continuous full-count counters in SCB Performance Counters 
plus one in OCLA gives you 3 at once. 

2 in SCB Performance Counters plus 2 in OCLA gives you 4 at once. 


11.17.5 Defensive Programming 


Sometimes when you use OCLA on an Ice9 you don’t know how that OCLA was used previously. State can be 
left around that will confuse the results of your OCLA run, or even interferes with it’s operation! Even the same 
OCLA config-and-run done twice in a row can have problems the 2nd time you do it. Don’t rely on reset values of 
any OCLA register in LAC, Trigger Blocks, or Collector Blocks. Reset is often long ago, with much history since. 

Do one or both of: (a) Before your OCLA run, run an OCLA-config and LAC-program specifically designed to 
clean up everything. (b) Code your OCLA config and LAC program “defensively”, to clean up everything it can in 
the beginning, as it gets started. 

Here’s a list of things to clean up before or during your config and LAC program, with “when to clean it up” in 
parentheses. 


e LAC Flag-0 (early LAC) 

e LAC Flag-1 (early LAC) 

e External OCLA trigger output pin (early LAC) 
e LAC Debug Interrupt (config before) 

e LAC Slow Interrupt (config before) 
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e all LAC Mask and Match registers, used or not (config before) 


every-CTB’s EnableCollect (config before) 


every-CTB’s Write Address (config before) 


e you CTB is stuck-at-full (config before) 


every-CTB’s contents (separate OCLA run before) 
e CollectTrace (separate OCLA run before, if needed) 


All of these could be accomplished by a separate “cleanup” OCLA config-and-run, but most of the cleanup can just 
be included as part of the OCLA config-and-run you are writing for your desired purpose. 

“early LAC” means clearing these in the first few instructions of your OCLA LAC program. 

“config before” means during the SCB-registers configuration you must do to get ready to run your LAC program. 

“separate OCLA run before” means doing a generic “cleanup” OCLA run, involving SCB-registers configuration, 
loading a LAC program, running that LAC program, maybe followed by more register writes. 

Well-written LAC programs, properly manually-terminated if they don’t see their trigger, do not leave the 
Collect Trace signal ON afterwards. But if it’s ON, you may need or want to shut it OFF. 


11.17.6 CTB stuck-at-full 


From trial and error we’ve found it best to write the appropriate R-CtbxColCtb twice for each CTB you are 
using, otherwise you risk not collecting anything. 

First write: EnableCollect=0, WtAddrClr=1, ExtMuxSel=your_desired_mux_setting, QTrigState and QualTrig 
= your_desired_settings, StopOnFull doesn’t matter. 

Second write: EnableCollect=1, WtAddrClr doesn’t matter, ExtMuxSel=your_desired_mux_setting, QTrigState 
and QualTrig = your_desired_settings, StopOnFull=your_desired_setting. 


11.17.7 Shutting-Off Collect'Trace 


In Ice9A chips, sometimes lac_ctb_CollectTrace_c2a gets left on. That can cause problems reading CTBs, and 
problems with the next OCLA run. 

This is fixed in Ice9B and later to have a shut-off of the LAC program also shut-off Collect'Trace. 

Collect'Trace can only be turned ON or OFF by a SETCOLL or CLRCOLL opcode in a running LAC program. 
In Ice9A there are no register writes which can turn it OFF, although a reset of the chip will turn it OFF. 

All LAC programs should make sure to do a CLRCOLL before reaching their final state, or in their final state, 
no matter whether whether they have a “good” termination or “bad” termination (like a timeout, or user-requested 
termination). 


11.17.7.1 Why would CollectTrace be Left ON? 


CollectTrace can be left ON due to a bad LAC program, a LAC program with no timeout that never got a 
trigger, or by writing GO=0 to stop a LAC program in the middle, when CollectTrace is still ON. In Ice9A, now 
only a running LAC program that executes opcode CLRCOLL can shut it off! 
11.17.7.2 Why is CollectTrace ON a Problem? 


If CollectTrace is still ON after running a LAC program: 


1. You may get all-zeros when reading-out the contents of a CTB! (even though the CTB does not contain 
all-zeros) Although misleading and frustrating, this can be solved by clearing that CTB’s EnbleCollect bit. 


2. As you start configuring for your next OCLA LAC program, some or all of the space in your CTB may 


get used-up before you can even say GO to your new LAC program! This applies when using a CTB in 
StopOnFull mode. 
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11.17.7.3 Is CollectTrace ON? 


To find out if CollectTrace is ON, read bit “Collecting” in the R-CtbxColCtl of any CTB (even if that CTB has 
EnableCollect=0, or even even if it has StopOnFull=1 and full). 


11.17.7.4 How to Read CTB Contents While CollectTrace is ON 
Clear bit EnableCollect in the CTB’s R-CtbxColCtl, then read the CTB. 


11.17.7.5 Fastest Way To Shut Off Collect'Trace in Ice9A 
1. Write 0x00000000 to R_LacCtl. 
2. Write 0x00000000 to all 5 Aggregate Mask Registers, R-LacAggMsk[4:0]. 
3. Write Oxffffffff to all 5 Aggregate Match Registers, R-LacAggMat|4:0]. 


4. Write 0x007 to R-LacRam[0x000], R-LacRam[0x400], R-LacRam[0x800], R-LacRam[0xc00]. (This is a tiny 
LAC program. There is no need to write or clear the other LAC locations.) 


5. Write 0x00000001 to R_LacCtl. 
6. Write 0x00000000 to R_LacCtl. 


This should clear CollectTrace. 

Of course you’ve now slightly messed-up your previous LAC program and previous OCLA registers configuration. 
You can either try to restore the changed values or load a complete new configuration and program for OCLA. 

To restore: Prior values of R-LacCtl, R-LLacAggMsk|[4:0], and R-LacAggMat|4:0] could be read and remembered 
ahead of time, then restored afterwards. R-LacRam is write-only, so to know what values to restore to it you'll 
have to read your LAC-program source-code, or look at a logfile. 

How this shuts-off Collect'Trace: 

The value 0x007 in R-LacRam|0x000] means {CLRCOLL, GO TO State-0}. The instruction in R-LacRam[0x000] 
will get executed by the write of 0x00000001 to R_LacCtl, and then CollectTrace will be OFF. 

The other writes are to keep CollectTrace OFF during the time it takes to write 0x00000000 to R_LacCtl. Many 
LAC steps may get executed during that time. State-0 has 128 locations in LacRam, depending on Aggregate Match 
and counter overflow bits. The writes to Aggregate Mask and Match registers will zero-out the Aggregate Match 
bits, Reducing State-0 to only 4 locations based on counter overflow bits. With an 0x007 in all 4 of those locations 
we'll stay within those 4 locations, and not start executing other instructions of the previous LAC program (which 
might contain a SETCOLL). 
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12.1 Overview 


This section describes the “miscellaneous” pieces of the ICE9 chip. These include: 


Clock generation and distribution 

ECC general description 

The Design For Test (DFT) support for internal test scan and boundary scan at manufacturing.! 
Reset and related logic 


Boot time-line 


12.2 Differences, Bugs, and Enhancements 


12.2.1 Product and Chip Pass Differences 


1. 
2: 


8. 
or 


ICEQA1 returns a different revision (ICE9A1 vs ICE9AO) when reading the IDECODE register. 


ICE9B fixes Sms Reset syncronized to the wrong clock, bug2055. This required the smsclock to be turned off 
whenever we wiggle reset, then turned on again a bit later. 


. ICE9B eliminates R_SysTapDint, replaced with the SCB-space R_ScbDInt, bug2223. 
. ICE9B supports transmit interrupts for R.SysTapAtnMsp, and separates RWIC bits, bug2222. 


. NEED IMPL: TWC9A changes the default value for R-SysTapPILD*clkDifv to support a processor default 


clock frequency of *FIX* MHz, bug3384. 


. TWCO9A fixes access to any SCB bus slave hanging while the DDR controller is in reset, bug2928. 


. TWOO9A adds an R_SysTapReset_Lac and _Pmi to separate the R_SysTapReset_Scb bit from also controling 


the BBS/PMI reset, bug2929. Earlier products needed caution when maintaining FSW/FL traffic during 
partial reboots. 


NEED IMPL: TWC9A adds R_SysTapReset_Proc6, and _ProcSms6 to support the additional cores. 


TWC9A uses R_SysTapInstrTwe instead of R_SysTapInstReg to support the additional cores. 


See also the “IEEE Standard Test Access Port and Boundary-Scan Architecture” ref. document; IEEE Std 1149.1-2001 IEEE Joint 
Test Action Group (JTAG). 
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10. 


11. 


12. 
13. 


TWCQ9A adds R_SysTapScb64 to access doubleword SCB registers. Code should use this new registers or 64 
bit SCB registers will not be visible. 


NEED IMPL: TWC9A adds R_SysMemInit register and associated functions for on-chip memory initalization. 
In previous products BIST was used to initalize on-chip memories. 


NEED IMPL+SPEC: TWC9A will merge the SysChain and E-Silicon chain on-chip instead of off-chip. 
NEED IMPL+SPEC: TWC9A will replace or make the E-Silicon chain IEEE compliant (on the correct edges). 


12.2.2 Known Bugs and Possible Enhancements 


1. 


[Larry] Add a new LBS+SCB region. The msp could set the start address in 32 or 64 bit steps, and then scan 
in, say 128 bytes with a continuous shift on the scan. Then, while the ice9 digests that block, the msp scans 
in 128 bytes into the alternate half of the block. This is essentially a block of shared memory accessed on the 
ice9 side by scb and on the msp side by efficient scan. The scan chain would shift in a direction compatible 
with the qspi as well. This shared area would be used instead of fastdata (since it would be much faster) for 
boot2 loading, and we would also use it for block transfers of attn data instead of doing that 26 bits at a time 
via the current attn register. 


12.3. Clock generation and distribution 


12.3.1 Goals and Features 


The Sicortex system clock architecture (includes specifics of board design) has the following goals: 


1. 


The system clock architecture has one system clock (sys_clk) and each board receives a copy of the sys_clk. 
The system clock architecture will minimize the possibility of a single point of failure in the clock tree. 


. The distribution frequency of the system clock (sys_clk) will be 66.67 MHz and with a long term accuracy of 


100 ppm, and jitter spec of +/- 50ps. 


. Each ICE9 chip will generate on-chip clocks for its sub-systems using 2 (differential) copies of sys_clk 


(sys_clk_e_h/1 & sys_clk_o_h/1l). Thus all generated clocks in the system will be derived from a single os- 
cillator. 


. The inter-ICE9 fabric is a “Mesochronous Interconnect” where each node in the fabric is frequency locked 


(but not phase locked) with every other node. 


. The fabric switch operating speed is targeted at 200MHz. Correspondingly, the fabric link will operate at 


8B10B encoded data rate of ten times the operating speed of the fabric switch. The PLL design will allow 
adjusting fabric switch clock speed by up to +/- 25% from its design goal. 


. The primary design goal of the processor/cache operating speed is 500MHz/250MHz. The PLL design will 


allow selecting processor/cache clock speed by as much as +/- 20% from its design goal. 


. The primary design goal of DDR2 interface is to operate with industry standard SDRAM DIMMs. The 


industry standard SDRAM are (will be) available at 200/266/333/400 MHz clock speeds. The PLL design 
will allow DDR2 clock speed selection from 200 MHz to 400 MHz. 


. The primary design goal of PCle root complex and PCle controller is to use clock at 250 MHz. The primary 


design goal of PCIe PHY is to use RefClk clock at 125 MHz. These clocks come from the PCI Express PHY. 
The PLL design will generate PCI reference clock at 100 MHz for use by the PHY and to be driven off-chip 
for use by an attached card. 


. The PLL design will allow configuring each PLL in BYPASS mode. (See the test clock discussion in Section 


2722.) 


Clock generator features of ICE9 are listed below: 


1. 


ICE9 clock domains can be categorized into four clock groups as follows: 
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(a) Group-A: For Fabric switch and fabric links, sclk from 200MHz to 250MHz. 


(b) Group-B: Processor/cache clocks, pclk/cclk maintaining phase aligned 2:1 frequency ratio for pclk and 
cclk. The range of pclk is from 400 MHz to 800 MHz. 


(c) Group-C: DDR2 clocks, dclk. This group will need dclk and dclk90. The operating range of dclk is from 
250 MHz to 400 MHz. Operating values are 200, 267, 333, and 400MHz. Each of the the two DDR2 
interfaces has it’s own PLL to generate the in-phase and quadrature clocks: diclk & dlclk_m90 and 
dOclk & dOclk_m90. 


(d) Group-D: PCle interface, pci_ref_clk/pci_ref_clk_x2 maintaining a 1:2 frequency ratio, phase alignment 
is not necessary, for pci_ref_clk and pci_ref_clk_x2 at 100 MHz and 200 MHz. 


2. ICE9 will use one PLL design, (called PLL_AB), to generate clocks for various sub-systems. 
The PLL_AB design has two outputs. The relationship bewteen the two outpus is configurable from three 
choices. The output selection choices are : 


(a) DIV2-0deg, DIV4-Odeg : factor of 2 frequency difference, outputs are phase aligned. 
(b) DIV4-Odeg, DIV4-90deg : same frequency, 90 degree phase shift between outputs 
(c) DIV4-0deg, DIV8-0deg : factor of 2 frequency difference, outputs are phase aligned. 


3. ICE9 has total of five instances of the PLL_AB design. The 5 PLLs are placed in 2 groups: one near the 
south-west (odd-link) corner of the chip and one near the north-east (even-pci) corner of the chip. The east- 
side PLL group contains the pclk/cclk PLL, the pci_ref_clk PLL and the d0clk/d0clk_m90 PLL. The west-side 
PLL group contains the sclk PLL and the dlclk/d1clk_m90 PLL. 


4. ICE9 will get 2 copies of the differential sys_clk on 4 reference-clock input pins. The “RefClk” pin of all 5 
instances of the PLL_AB will be connected to the sys_clk nearest it. 


12.3.2 Sys_clk distribution tree 


The Sicortex system will use a backplane and connectors as the inter-board connection medium. The backplane 
will not have any active components. Boards make signal connections to each other through its connector on the 
backplane. 

In the chassis, the clock distribution tree originates at an oscillator operating at 133.33 MHz. The oscillator 
output is divided by 2 and then distributed as “sys_clk” at 66.67 MHz to all boards. On board, a copy of sys_clk 
is connected to the 2 “sys_clk” inputs of each of the 27 ICE9 chips. Because all copies of sys_clk in the chassis 
originate from a single oscillator, all generated clocks in ICE9 are frequency locked w.r.t. to each other. The sys_clk 
input to ICE9 is in 2 distinct pairs of LVDS pins received in 2 LVDS receivers - one for the southwest PLL group 
and one for the northeast PLL group. The board-level sys_clk distribution tree has 54 sys_clk destinations on each 
module board (2 for each ICE9 chip). 

The system clock distribution scheme is shown in Figure-12.1. 

Figure 12.1 shows three connectors, M, N, and P, each receiving copy of sysclk and driving buffered version of 
the sys_clk to 27 ICE9 chips with 2 receiver ports each. The on-chip clock generating in ICE9 consists of 5 instances 
instances of PLL_AB. 

Figure 12.2 shows that in ICE9, the clock PLLs generate clocks for Processor /L2-cache, DDR2 interface, PCIe 
interface, and the fabric switch. The fabric switch clock, in conjunction with multiple fabric link receiver PLLs, 
and multiple fabric link transmitter PLLs, builds the complete clocking scheme of the fabric links. The fabric link 
clocking is described below. A similar strategy is employed for the PCle SERDES links. 

Each fabric link connects two logically adjacent ICE9 chip using SERDES PHY technology which drives em- 
bedded clock and data on differential pair of wires. The sclk PLL generates the clock for fabric switch which is 
which is also used by the link transmitter PHY. The fabric link transmitter PHY has a PLL, called Tx_PLL (an 
integral part of the link PHY), which uses the fabric switch clock signal as a reference clock and drives a serial data 
stream on the transmitter PHY port five times faster than the switch clock in DDR mode. The fabric link receiver 
also has a clock-data-recovery PLL (CDR-PLL), also integral to the link PHY and dedicated to the receive lane, 
to recover data and clock from incoming data streams. 

In Figure 12.1, five instances of the PLL_AB will use sysclk at 66.67 MHz as a reference clock which is sourced 
from single oscillator, hence, all generated clocks will operate in frequency locked mode w.r.t. each other. 
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Figure 12.1: Clock Tree Distribution 


Note-: Jitter spec (estimated, needs validation) of sysclk at the pins of ICE9 


OSC : 133.33 MHz 
Divider : DIV2 
BUF : 36_FO 


BUF : 27_FO 
sysclk @ ICE9 


12.3.3. Clock Generation in ICE9 


The clock generation for ICE9 takes place in 2 physically distinct PLL groups. For logical purposes these may 
be treated as a single module, though the chip hierarchy will include them as separate entities. The logical clock 
generation module is shown in Figure 12.2. It has five instances of the PLL_AB and it generates sclk for the fabric 
switch interface, pclk/cclk for the processor core and L2-cache interface, separate dclks for the DDR2 controllers, 
and pci_ref_clk for the PCle interfaces. Each instance of the PLL_AB has several control signals, described below. 
There are two instances of the PLL_AB for generation of the two dclks. Each dclk domain (d0clk & diclk) will be 
provided with a “normal” clock signal (used for the majority of the logic) and a -90-degree phase clock (used only 
in the PHY). 


12.3.4 PCle clocking 


The clocking scheme for the PCI express interface has changed from the original plan. The PLL originally 
planned to generate the 250MHz iclk will now generate a 100MHz pci_ref.clk from the 66.67MHz sys_clk. The 
100MHz pci_ref_clk will then be driven off-chip to the clock pin of the PCIe slot on the module board (perhaps 
through a buffer or level translator). It will also be driven to the PCIe PHY, where it will be used to generate the 
250MHz iclk (and internally to clock the SERDES transmitters). The result is that the root of the iclk tree will 
now be an the output pin of the PCIe PHY. 

Note that the PCI Express specification allows the reference clock frequency to be “downspread” by up to 0.5%, 
to allow spread-spectrum clocking for radio-frequency emissions control purposes. The system design may take 
advantage of this by using more widely available 133MHz oscillators, resulting in a 66.5MHz sys_clk frequency, 
0.25% below nominal. This works because both ends of all our PCI Express links will use the same reference clock 
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PCI link data rate @ (8 x 3.1)GBS 
2.5 Gbytes/sec (unidirectional) 
5 Gbytes/sec (Full Duplex) 


iclk @ 250 MHz Tx-link 
data rate @ data rate 2GBS 
8.5/10.6/12.8 Gbytegfsec switch fabric > 
(without ECC) sclk @ 200Mhz ag 
DDR2 t 1.6 Gbytes/sec 
controller 3.2 Gbytes/sec (full duplex) 
se peirefclkx2/pcirefclk @ eink 
dclk @ 500/100 Miz Pre coe cprpLL |~«ata rate 2GBS 
266/333/400 Mhz Proeecores 
AA ceclk(pelk/2) @ A 
200 — 400 Mhz 
L2 cache 
A 
sclk @ 
200/250 Mhz 
sysclk @ refclk @ Pllsw/Pline 
66.67 Mhz 166.67 Mhz _ (0 instances of PLL_AB in 


2 groups) 


d0/Iclk PLL Multiplier[3:0] 
d0/1clk Clock Rate Selector[1:0] 
iclk PLL Multiplier[3:0] 
iclk Clock Rate Selector[1:0] 
pelk PLL Multiplier[3:0] 
pelk Clock Rate Selector[1:0] 
sclk PLL Multiplier[3:0] 
sclk Clock Rate Selector[1:0] 


Figure 12.2: ICE9 Clocks and Data Rates 


as just described. 


12.3.5 Block diagram of PLL_AB 


The block diagram of the PLL_AB is shown in Figure 12.3 and the pins are listed in Table 12.1. 

The PLL receives REF input as its reference clock input and its VCO multiplier factor through DIV signal. 
The PLL_LOCK signal is a status signal which will be set when PLL has acquired lock. The PLL can be held in 
reset state by RESET signal. 

There are 2 outputs from PLL_AB. They are PLLOUT-1 and PLLOUT_2. Both outputs from PLL are config- 
urable through OUTPUT_SEL signal. There are 3 choices of output selection. 

The PLL_AB also supports PLL in bypass mode when BYPASS_ENAB signal is set. In bypass mode, there 
are 2 options available for selecting BYPASS_CLK at two output ports - (a) Both outputs are connected to 
BYPASS_CLK, and (b) One of the outputs is connected to the half frequency clock of BYPASS_CLK. 


ICE9 PLL Instantiation & Configuration Notes: 


1. The RESET signal for the PLL_AB must be gated with a decode of {test_mode_en, test_mode|*]} to ensure 
it is asserted in the appropriate scan modes. 


2. All pins (including REF signal) of PLL_AB are regular core-voltage CMOS signals. 


3. Control signals for the PLL_ABs which are CSRs must be explicitly registered on the appropriate chain. The 
PLL macro does not register the bits internally. 


4. Any changes to CSR bits affecting PLL operation should be appropriately guarded by reset for both the PLL 
and downstream (clocked) logic to prevent deleterious effects due to unstable PLL operation, clock glitches, 
runt pulses, etc. 


5. Invalid settings: When DIVF/4:0] is less than 5’d11 or greater than 5’d23 or OUTPUT_SEL/1:0] equals to 
2’d3, no damage will occur to the PLL, but the output behavior is not defined. 
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OTE 
PLL_AB Reference clock at 66.67 MHz 


RESET SysChain PLL_AB PLL internal reset. 

This signal is gated with scan_enable and stays as- 
serted during chip reset. 

DIVF/4:0] SysChain PLL_AB VCO feedback divider encodings of 4’d11 through 
4’d23 will provide multiplier from 12 to 24. Multi- 
plier value = (DIVF/4:0] + 1) 

OUTPUT_SEL|1:0] SysChain PLL_AB PLL output selector for [PLLOUT_1, PLLOUT_2]. 
The selector encodings are: 0 - DIV2, DIV4 (both 
outputs are phase aligned) 1 - DIV4, DIV4-90 2 - 
DIV4, DIV8 


BYPASS_ENA PLL_AB PLL bypass enable 
BYPASS_CLK1 PLL_AB Bypass clock when BYPASS_ENA is set 


BYPASS_CLKO0 PLL_AB Bypass clock when BYPASS_ENA is set 


BYPASS_CLK_SEL SysChain PLL_AB selects BYPASS_CLKO or 1 when for output BY- 
PASS_ENA asserts 


LOCK PLL_AB SysChain PLL Lock indicator 


PLLOUT_1 PLL_AB clock-tree PLL_1 output. 
This signal has 50% duty cycle in normal mode. 
Refer to encodings of OUTPUT_SEL[1:0] 
PLLOUT_2 PLL_AB clock-tree PLL_2 output This signal has 50% duty cycle 
in normal mode. Refer to description of OUT- 
PUT_SEL|[1:0] 
Rnalog power and ground pas (chip bumps 


VDD/VSS 1/O ring PLL_AB Core power/ground, connect by abutment in the 
I/O ring 


Table 12.1: PLL_AB Pins 
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PLL_AB 


BYPASS _ENA 
BYPASS_CLK_SEL— 
BYPASS CLK] | > 


BYPASS_CLKO 
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RESET__| 
DIVE[4:0]__| 
OUTPUT SELf[1:0] _| 
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vvvy 


|_| y- PLLOUT_2 


Figure 12.3: PLL_AB Block Diagram 


BYPASS_ENAB | RESET BYPASS_CLK_SEL || PLL_LOCK PLLOUT_1 PLL_OUT_2 


0 | 0 [|__| Nonmal mode [Normal mode [Normal mode] 
eo ok oe Ow OW 


a x00 PASS CERT | BYPASS_CLRO 
XY PASSCERT | BYPASS-CLEI 


Table 12.2: PLL Bypass Control 


6. The divider flops, including “DIV2” flop on BYPASS_CLK path, are not scannable. If they do not work it 
will become apparent when no clock is observed. 


12.3.5.1 Bypass mode in PLL_AB 


Each PLL_AB has three primary pins to support bypassing PLL. Those pins are BYPASS_ENAB, BYPASS_DIV2_ENAB, 
and BYPASS_CLK. The output of the PLL_AB will be selected as per Table 12.2. The pins are driven by the test 
mode controller based on the state of the test mode pins described in Table 12.4 and by the SysChain scan control 
chain that is used by the module service processor to initialize and configure the ICE9 chip. (See Section 12.6.9.) 


12.3.6 Implementation of PLL_AB 


ICE9 will have five instances of PLL_AB to generate primary clocks - sclk, pclk, dclk, and pci_ref_clk. (There 
are 2 instances of the DDR clock PLLs for the 2 dclk domains.) The clock implementation is shown in Figure-12.4. 
The implementation scheme provides range of operating speeds for each clock by varying DIVF|4:0] input. 

Valid settings and the range of clock outputs for those settings are shown in Table 12.3. 

Note that the first row identifies clock name and the value of OUTPUT_SEL[1:0] pins in brackets. This register 
is controlled via the SysChain scan registers described in Section 12.6.9. 

The 5 PLLs are placed on the chip in 2 groups: Pllsw & Pline. Pllsw contains an LVDS sys_clk receiver, PLLs 
for d1iclk/d1clk90 & sclk, and an LVDS driver for test_clk_o_h/1. Pllne contains contains an LVDS sys_clk receiver, 
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PLL. sclk (200 MHz) PLL. pclk (500/250 MHz) 
SYSCLK Pree PLL_AB SYSCLK Prez PLL_AB 
p> RESET p> RESET 
clk_lock 
S411 plorweps.o) PLL_LOCK & sclk lock 5414 vero] PLL_LOCK pe Po 
1 PLLOUT_1 |} pclk 
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b> BYPASS _ENA b> BYPASS _ENA 
— BYPASS CLK SEL — > BYPASS CLK_SEL 
—> BYPASS CLKO —m> BYPASS _CLKO 
BYPASS _CLK1 BYPASS CLK1 


PLL_dclk (333 Mhz) 


(2 instances: d0 & d1) PLL. iclk (200/100) MHz) 
SYSCLK  plppy PLL_AB SYSCLK rer PLL_AB 
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PLLOUT_2|——® _ d?clk90 PLLOUT_2|——Ppcirefelk 
>| BYPASS ENA b>/BYPASS ENA 
— 1 BYPASS CLK SEL — >| BYPASS_CLK_SEL 
—t>| BYPASS_CLKO — > BYPASS _CLKO 
[BYPASS CLK] BYPASS CLK1L 


Figure 12.4: Clocks using PLL_AB 


Table 12.3: PLL VCO Scaling Factors 
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PLLs for d0clk/d0clk90, pci_ref_clk, & pclk/cclk, and an LVDS driver for pci_ref_clk_l/h. 
The test_clko_o_h/] and the pci_ref.clk_h/] LVDS output pins are driven through muxes to select several oper- 
ational and test clocks as indicated in section 12.6.9 


12.4 General ECC strategy 


This section on ECC strategy describes general guidelines for implementation of ECC on the ice9 chip. Specifics 
of how the ECC is implemented in any given section are described in the appropriate chapter of this spec. 

The following registers should be implemented by memories which have ECC generation and/or checking. All 
of these registers are read/write master/slave registers on the SCB (or other software-visible bus/chain). Access 
to SCB registers and operation of the SCB is described in the ’Serial Configuration Bus” chapter of the chip spec. 
The specific names of these registers is documented with each section’s SCB registers. 


Control Registers | | Status Registers 


ECC_Mode_Register|1:0] | | ECC_Error_Status_Register[2:0] 
ECC_Drive_Bad_Data_Register(1:0] |_| ECC_Error_Address_Register[x:0] (not all cases, see below) 


| | ECC_Error_Syndrom_Register[7:0] (not all cases, see below) 

ECC handling for the L1 caches (I & D) has been modified to leverage the existing parity and interrupt 
mechanisms in the M5Kf processor core and is therefore somewhat differenc than described here. The L1 I-cache 
treats a parity error as a miss, which causes a fetch from the (ECC protected) L2 cache. This effectively provides 
single-bit-error correction but not double-bit-error detection. The L1 D-cache implements byte-wide ECC to support 
byte writes. See the Processor Segments chapter for more details. 


12.4.1 ECC Control Register descriptions: 
12.4.1.1 ECC_Mode_Register[1:0] (associated with ECC correction) 


ECC_Mode_Register[1] - ECC error detection enable: Enables Writing of ECC status registers and assertion of the 
ECC interrupt line from this block. 


ECC_Mode_Register|0] - ECC error correction enable: Enables ECC correction of data passing through the cor- 
rection block 


12.4.1.2  ECC_Drive_Bad_Data_Register|[1:0] (associated with ECC generation) 

ECC_Drive_Bad_Data_Register[1] - flip bit [1] of the data coming out of the ECC generator (into the storage array) 
ECC_Drive_Bad_Data_Register[0] - flip bit [0] of the data coming out of the ECC generator (into the storage array) 
Asserting either causes a single-bit error to be generated. Asserting both causes a double-bit error to be generated. 


Note: 


e In most cases, ”ECC_Drive_Bad_Data_Register” applies to all writes after the bit(s) are set, relying on software 
restrictions (i.e., clearing the register bit at an appropriate time) to ensure that reasonable behavior is obtained 
during software testing. 


e If convenient, ”ECC_Drive_Bad_Data_Register” MAY be implemented as a single-cycle operation (i.e., only 
the first write after asserting bits in the register contains bad data; then the register bit is cleared & subsequent 
writes return to normal operation). 


12.4.2 ECC Status Register Descriptions 
12.4.2.1 ECC_Error_Status_Register/[2:0] (associated with ECC correction) 


ECC_Error_Status_Register[2] - sets if more than one ECC error occurs, i.e, if ( ECC_Event_Occurs && ECC_Mode_Register[1] 
&& (ECC_Error_Status_Register[1] || ECC_Error_Status_Register[0]) => set ECC_Error_Status_Register[2] 


ECC_Error_Status_Register[1] - sets if an ECC-correctable error is detected 


ECC_Error_Status_Register[0] - sets if a non-correctable ECC error is detected 
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Note: 


e Updates of ECC_Error_Status_Register due ECC errors are blocked if ECC_Mode_Register[1] is deasserted. 


For ECC correctors on the path to/from main memory (i.e., coming on/off the CSW), the following 
2 registers may also be required: 


12.4.2.2. ECC_Error_Address_Register|x:0] - x depends on the size of address space (associated with 
ECC correction) 


Holds the (physical) address of the first ECC error since setting of any bit of ECC_Mode_Register[1:0]. This 
register is required only for ECC checkers for data on the main memory path in ICE9 (i.e., at the CSW interfaces 
to the L2 caches in the processor slices, and, optionally, at the Pci/Csw interface and at the Dma/Csw interface.) 


12.4.2.3 ECC_Error_Syndrom_Register[7:0] (associated with ECC correction) 


Holds the syndrome of the first ECC error since setting of any bit of ECC_Mode_Register[1:0]. This register is 
required only for ECC checkers for data on the main memory path in ICE9 (i.e., at the Csw interfaces to the L2 
caches in the processor slices, and, optionally, at the Pci/Csw interface and at the Dma/Csw interface.) 


Note: 


e bits of ECC_Error_Status_Register & ECC_Error_Address_Register are set by the ECC logic during operation. 
Clearing of the register bits following an ECC event is up to software as a part of the interrupt routine triggered 
on a ECC event. 


e Separate ECC_Error_Status_Register, *_Address_Resister and *_Syndrom_Register will be required for data 
coming out of the L2 cache and for data coming out of the CSW to distinguish between ECC events in the 
L2 and events in the CSW/DDR memories. 


12.4.3 ECC Implementation & Test considerations 


In order to test the ECC logic during manufacturing chip test, we’ll need to ensure observability of the outputs 
of the ECC generation logic and controllability over the inputs to the ECC correction logic. If we don’t do anything 
special this is a problem because the whole point of ECC is to transparently correct errors without impacting 
normal operation. So, what we’re doing is the following: 


12.4.3.1 Compiled memories with Synchronous Write Through (SWT) mode 


When the Virage compiled memory supports SWT, we'll use it. With appropriate control settings, SWT 
provides a path for the write data coming into the memory to bypass the array and instead go to a flop, which 
is then driven (through a mux) to the output pins. The additional logic is incorporated in a wrapper around the 
memory array. The added flop is on a scan chain with control signals, scan-in and scan-out brought to pins of the 
wrapper. See Figure 12.5 

The BypassMUX and OutputMUX select signals must be set appropriately during test (by tying them to a 
decode of test-mode). Once that’s done, ECC generator outputs become observable via the scan flop and it’s 
scan chain. Controllability over the inputs to the ECC correction logic is accomplished via the same mechanism. 
Nothing special is required in the design of the logic around the RAM. 


12.4.3.2 Compiled memories with Asynchronous Write Through (AWT) and no Synchronous Write 
Through (SWT) 


For this case, there are 2 concerns: 1) observability and controllability for testing the ECC logic, and 2) ensuring 
that AWT does not introduce combinational loops. Since the compiled memory does not provide a convenient scan- 
flop and we’ll need to provide one externally (”rammaker” will be modified to do this by default). We have a choice 
of putting the mux on the memory inputs or outputs; to be consistent with what’s provided for SWT-enabled 
rams, we’ll put in on the output, unless there’s a reason not to. If necessary, the flop and mux may be inserted 
into upstream of the RAM on the data input side of the compiled memory & wrapper; see Vasu about a change 
to rammaker if you need to do this. (See Figure 12.6.) The OutputMUX select signal should be tied to a decode 
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WRAPPER BMUX 


MEMORY 


Figure 12.5: SWT ECC observability & controllability 


of test_mode, as should the BypassMUX select signal. In this case the OuputMUX and the flop must be explicitly 
incorporated into the design. Scan insertion of the flop will ensure the the necessary observability /controllability 
is achieved. (If the instance of the RAM requires immediate flopping of read data before ECC correction, there is 
no need to add anything special; observability & controllability are already available.) 

By default, the explicit flop & mux should be added to both data and ECC correction bits. If no combinational 
loops are introduced by AWT, the flop/mux may be added to only the ECC bits, thus saving on the flop count. 


WRAPPER BMUX CLK 


Figure 12.6: AWT ECC observability & controllability (also breaks any combinational loops) 


12.5 DET and Test Support 


The ICE9 chip supports two different scan interfaces for test. 

The first is a serial “muxscan” interface used for chip test at wafer and die test stages. It provides up to 100 
parallel scan chains and test mode configuration pins. The scan modes are selected via the test mode input pins as 
shown in Table 12.4. The control pins relating to muxscan features are all prefixed with the name “test_”; any pin 
with the prefix “test_” is used in test-modes only and can be tied off for normal operation. As per eSilicon’s practice, 
the test control pins are: test_scan_en (eSilicon’s name is scan_enable), test_mode_en (eSilicon: chip_test), 
and test_mode[2:0] (eSilicon: test-mode!). When the ICE9 chip is installed on a module, test_scan_en and 
test_mode_en will be tied FALSE and the other three test_mode[2:0] pins will be ignored. In muxscan mode 
(“stuck-at scan” and “transition fault scan”), the DDR DQ & AD pins provide 88 bits of scan data output and 
scan data input between the two DDR controllers. The DDR DQ & AD spins also have test_sdi|*] & test_sdo|*] 
overrides. See Section 17.3 for a complete list of signal pins and test-mode overrides. The remaining 12 bits of 
scan in and scan out are provided on dedicated pins labeled test_sdi[99:88] and test_sdo[99:88]. Some of the entries 
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in Table 12.4 seem to be duplicates with respect to PLL bypassing. In some cases, they are assigned different 
test_mode[3:0] entries due to different test-mode overrides. 

Anytime the PLL output is bypassed with a test_clk or syc_clk, that PLL should be held in reset by the LBS. 

The second test interface is the JTAG test scan chain used for boundary scan. This mode is implemented in 
an IEEE-JTAG 1149.1 Test Access Port (TAP) controller supplied by eSilicon. The JTAG chain has its own chip 
pins, prefixed with “jtag_” and only these signals carry the jtag_ prefix. 

The SysChain, described below, is used for in-system maintenance and initialization. The SysChain may also 
be used to set PLL controls and Virage RAM configuration parameters during manufacturing test. 

Because specifics of the distribution of the clocks and reset signals is important to ATPG test generation, it’s 
further described in Figure 12.10 


12.5.1 Boundary scan (normal mode) 


For board-level continuity testing, the chip supports JTAG boundary scan. The PCle PHY comes as a hard 
macro with boundary scan pre-inserted. The link PHYs do not support boundary scan. The DDR I/Os, LVDS 
clock I/Os and selected general-purpose I/Os will have boundary scan cells inserted by eSilicon along with the 
JTAG TAP controller insertion. The boundary scan-chain ordering follows the diagram below (JTAG TAP -> 
DDPo -> Plisw -> DDPe -> Pllne -> PCIe PHY -> general purpose I/O block -> JTAG TAP): 


+------------ ++----------- ++------ + 
| /---------- <<----------- <<----\ | 
| | StdI/O || Pphy ||Pllel | 
+-y---------- ++----------- ++----7-+ 
+---y---+ | 
| JTAG | | 
/-----< | | 
| to--- === + 
| a a a a a a ae a a |---\ 
| | | | 
+-y----+ | | +-v----+ 
| [iil 1 | | | 
| | | | 1 | | | 
| | Jel | | | | 
[ol De al | | |D | 
11D || | | |D | 
||P | | || | P | 
| lo | | ll le | 
| | Jel | | | | 
| | | | | | | | 
[sei Jel | | | | 
| | | | | | | | 
jril Mil 1 | | | 
+-y----+ | | +-v----+ 
+-y----+ | \---/ 
| |Pllwl | 
| +---->-/ 
+------ + 


12.5.2 Stuck-at Scan (test mode 16) 


eSilicon ATPG tests using mux-scan. Virage memories in SWT-mode (where supported) or AWT-mode. 


12.5.3 Transition Fault Scan (test_mode 17) 


Similar to stuck-at scan - eSilicon ATPG tests using mux-scan. Virage memories in SWT-mode (where sup- 
ported). AWT-mode should not be used due the multi-cycle paths created. 
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12.5.4 PLL Test (test mode 18) 


A test of the 5 primary clock PLLs. With a 66.67MHz differential sys_clk_o & sys_clk_e, 

- look for the lock indication from each PLL - present at the *clkLock pins (active in test-mode 18, see Section 
17.3) or in the PLL control register (Section 12.6.9) 

- Step through entries in Table 12.13, using the ClkOutCtrl[*] pins (active in test-mode 18, see Section 17.3) or 
the PLL control register, to bring out all possibilites listed. Depending on tester capabilities, check for presence of 
a toggling LVDS signal, check duty cycle, and check frequency. 


12.5.5 DDR ODT & Drive Strength Parametric Test (test mode 19) 


Similar to normal operation, boundary scan is used for parametric testing of the DDR PHY inputs & out- 
puts with controllable drive strength (impedance) and controllable on die terminaion (ODT). In this mode the 
test_sdi[99:88] & test_sdo[99:88] pins are used to control the drive impedance settings, the ODT termination set- 
tings, and the ODT (read) termination enable for both instances of Ddp. In addition, the results of the impedance 
calibration block for the 2 instances of Ddp are available. See Section 17.3 for detail on test mode 19 pin overrides. 
Because this test is performed with JTAG boundary scan functioning, the pins we overrride in this test mode must 
NOT be boundary scan inserted (or they may have observe-only boundary scan insertion). 

By making the results of the impedance calibration logic available at the chip pins, it is possible for the tester 
to check the impedance calibration using at least one and possibly several values of precision external resistor. 


12.5.6 Memory BIST and Repair (test mode 0, 20) 


Memory BIST is typlically done in the normal operating mode; bypassing PLLs with test_mode 20 is available 
if needed. This path uses the JTAG 1149.1 TAP controller to access the Virage STAR Memory self test and repair 
features. Two test-modes are provided, one with clocking from active PLLs, one with the active PLLs bypassed. 


12.5.7 DDR Functional Test (test modes 0, 21) 


It is expected that DDR functional tests will be done in normal operating mode. Test mode 21 is available if 
we want to bypass all PLLs for DDR functional testing. DDR functional tests probably require code running on a 
M5Kf core - specifics open here pending recommendation from the eSilicon DDR design team. We may need some 
pretty fancy load board design to support full-speed testing of the DDR I/Os. 


12.5.8 Slow DDR DLL Test (test mode 22) (whether all DLL tests will be used in 
mfg. test is still open) 


Note that both DLL test modes have a special set of pin overrides to allow the tester direct control over the 
DLLs. See Section 17.3 


12.5.8.1 DLL low speed test 1 (DLL vendor recommended) 


Control Slave Input, Observe Slave Output. 

. set DLL_BYPASS_SLAV= DLL_FORCE_INPUT= 1. 

. hold DLL in reset 

. set slave ADJ[] to max value 

. set TSTCTRL[2:0]=3, TSTCTRL[5:3)=3 (TSTCLK1= slave0_out; TSTCLK2= slavel_out). 

5. check that slave output is a buffered version of the slave input. This test can be performed by either applying 
an oscillating input and observing an oscillating output, or by setting the input to constant values and observing 
the same values at the output (in our case, this observability is accomplished through the DLL tstclk mux4 by 
selecting the slave outputs onto TSTCLK1/TSTCLK2 and verifying that the CLK_M90 is present. If we want a 
constant value on the slave0 input, this can only be accomplished by holding CLK_M90 either high or low, which 
would also appear to be ok since the DLL is held in reset). 


rPwnNy re 
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12.5.8.2 DLL low speed test 2 (DLL vendor recommended) 


Check Master Through TSTCLK Outputs. 

1. set DLL_BYPASS_SLAV= DLL_FORCE_INPUT= 1. 

. hold DLL in reset. 

. set MADJ[] to max value 

. set TSTCTRL[2:0]=0, TSTCTRL[5:3]=1 (TSTCLK1= ref_pd; TSTCLK2= fb_pd). 
. check that TSTCLK1 & TSTCLK2 are buffered version of RCLKI. 


oF W DY 


12.5.9 Fast DDR DLL Test (test mode 23) (whether all DLL tests will be used in 
mfg. test is still open) 


Note the both DLL test modes have a special set of pin overrides to allow the tester direct control over the 
DLLs. See Section 17.3 


12.5.9.1 DLL High Speed Test 1 


DLL vendor recommended test: 

. set DLL_BYPASS_SLAV= DLL_FORCE_INPUT= 1. 

. hold DLL in reset. 

. set RCLKI to lowest operating frequency required. 

. set MADJI] to a nominal value. 

. wait lus for the analog control to reset 

. release reset, wait 500 RCLKI cycles for the DLL to lock. 

. set TSTCTRL[2:0]=0, TSTCTRL[5:3]=1 (TSTCLK1= ref_pd; TSTCLK2= fb_pd). 
. check that TSTCLK1 & TSTCLK2 have closely aligned rising and falling edges. 


CONnowkwhd kr 


12.5.9.2 DLL Functional Slave Test 


Recommended by eSilicon: 

1. set DLL_BYPASS_SLAV= DLL_FORCE_INPUT= 1 

2. hold DLL in reset. 

3. set RCLKI to 400MHz. 

4. set MADJ[7:0] to 184 (Oxb8). 

5. wait lus for the analog control to reset. 

6. release reset, wait 500 RCLKI cycles for the DLL to lock. 

7. set TSTCTRL[2:0]=3, TSTCTRL[5:3]=3 (TSTCLK1= slave0_out; TSTCLK2= slavel_out) 

8. set ASIC pins: DDR_-DQSP{[8:0]=400MHZ, DDR_DQSNJ8:0]=(~ (400MHz)). (this is the input to slavel; 
slave0_input= CLKM90= RCLKI). 

9. set ADJO[7:0]= 0; ADJ1[7:0]= 0; (slave0_delay= slavel_delay= 562 ps). 

10. check the phase relationship of TSTCLK1 & TSTCLK2 relative to RCLKI (i.e. the input clock in step ’3’ 
above from the tester). Save this to variables phase_tstclk1_0, phase_tstclk2_0. 

11. set ADJO[7:0]= 92; ADJ1[7:0]= 92; (do not reset the DLL). (slave0_delay= slavel_delay= 1812 ps). 

12. check the phase relationship of TSTCLK1 & TSTCLK2 relative to RCLKI. Save this to variables phase_tstclk1_1, 
phase_tstclk2_1. 

13. compare the saved variables: 

result0O= phase_tstclk1_1 - phase_tstclk1_0; 

resultl= phase_tstclk2_1 - phase_tstclk2 

14. pass/fail: resultO & resultl should both be approx. 1250ps. Note: this test is accomplished on the tester 
by running one continuous pattern, as follows: 

a. apply signals 

b. run loop and find measured values 0. 

c. break loop. 

d. change ADJ[] signals. 

e. run loop and find measured values 1. 

f. break loop. 
g. compare the measured values 0 and 1. 
h. pass fail the measured variables. 
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12.5.10 PCI Functional Tests (test modes 0, 24, 25, or 26) 


Loop-back / PRBS tests as described in the PCle PHY documentation. These can be performed in the normal 
operating mode of the chip, with all PLLs for non-PCI clocks bypassed (and potentially inactive - test mode 24), 
with the AB pci_ref_clk PLL bypassed (test mode 25) and, optionally, the Synopsys PCIe PHY PLL bypassed as 
well (test mode 26). 


12.5.11 Fabric Transceiver Functional Test (test modes 27, 28) 


Testing and configuration of the Fabric Transceivers is via the ICE9 Serial Control Bus linkage on the SysChain. 
Most likely, this will be perfomed in a mode (27) which enables the fabric link PLLs in operational mode and drives 
all other clocks with the test clock input (CCLK, pci_ref.clk, and DCLK PLL is in bypass mode). This test can 
also be performed in the normal operating mode of the chip or with the sclk PLL bypassed. 

For a description of the path to load status into the link control registers see Section ??, and the link control 
register descriptions in Section 2.20. 


12.6 SysChain 


In operation, the ICE9 chip provides a system control scan chain interface (SysChain) to the Module Service 
Processor (MSP). The MSP uses this chain to load boot code into the ICE9 chip, enable and monitor clocks, assert 
and release internal reset signals and enable each of the chip’s subsystems. The SysChain is also used to read status 
from the chip and communicate with the processor core EJTAG interfaces. The MIPS EJTAG features are quite 
powerful and allow almost all of the operations normally obtained with an in-circuit emulator. See the MIPS 5Kf 
EJTAG specification for further information. 

The SysChain functions use the IEEE-JTAG 1149.1 protocol, but the SysChain is not a test feature. It is 
provided for maintenance and management of the ICE9 chip: JTAG just happens to be a handy protocol to 
provide this feature. All SysChain chip pins are prefixed with “sch_” and only those pins related to the SysChain 
carry the “sch_” prefix. 

Note that in order to be consistent between the various TAPs, the bit numbering convention for all SysChain 
TAP registers is MSB closest to TDI, while LSB is closest to TDO. 

The SysChain Test Access Port (TAP) consists of eight JTAG controllers wired in series, as shown in Figure 
12.7. The first (nearest TDI) is the PCI-Express TAP controller, which has an 8 bit wide Instruction Register 
(IR). Next is the SysChain TAP controller, which has a 5 bit wide IR. The remaining six controllers are the 
MIPS EJTAG TAPs, each of which has a 5 bit wide IR. This presents a composite SysChain TAP IR width of 
8+5+(6x5)=43bits. To complicate matters further, on the ICE9 module the E-Silicon JTAG chain is also wired 
in series in front of the SysChain, see section 12.6.15 and Figure 12.8. Therefore the TOTAL Length of the SysTap 
IR is: 


SysTap IR Length: 18+8+5+(6x5)=61 bits. 


Note that for all descriptions that follow, the COMPLETE JTAG chain is accounted for. Thus IR length of the 
System TAP chain includes both the externally (module) wired JTAG as well as the SysChain JTAG. 

Each TAP controllers’ IR selects which User-defined Data Register (UDR) is connected between that TAP’s 
Test Data Input (TDI) and Test Data Output (TDO) signals. All IR selectable UDR’s are documented in section 
12.6.5. Note that the relative position of each UDR stays they same, that is, first the selected E-Silicon UDR, 
followed by the PCI-Express UDR, followed by the ICE9 SysChain UDR, then the six MIPS EJTAG UDRs. Also 
note that the width each UDR occupies in the chain varies with the UDR selected. 

Typically, SysChain accesses will be confined to a UDR in one TAP controller. The MSP will select which TAP 
and UDR it wishes to access during the initial IR scan, placing the other TAPs into the JTAG BYPASS mode. 
When a UDR is being sampled, it is up to software running on the MSP to insure that the proper data values 
are shifted into this UDR during JTAG Capture-Shift-Update-DR operations to prevent signals from inadvertently 
changing. 

By wiring the TAP controllers in series there is a small amount of overhead introduced when shifting a particular 
UDR. Again, referring to Figure 12.7, notice that any E-Silicon UDR has eight downstream TAP controllers that 
in the best case are in bypass mode. This introduces eight bits of prefix data to any E-Silicon UDR being shifted 
out. For any PCI-Express UDR the overhead is one bit less and for any SysChain UDR the overhead is two bits 
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less, since there are only the six MIPS cores downstream of it. When accessing a MIPS Core UDR, the number of 
overhead bits will vary depending upon which core is being accessed, see Figure 12.7. When shifting data into a 
UDR the situation is reversed. In either case, the MSP must remember what UDRs have been configured on the 
chain in order to know their relative positioning. 


12.6.1 SysChain Ordering Rules 


A write to a syschain register may not immediately take effect, there may be downstream logic that requires 
extra syschain clocks for the write to complete. If software requires a write to have been completed before doing 
something else, it must follow the normal system ordering rule, namely read the register back. This read will insure 
the write has been completed. 


12.6.2 Vregs Package 
Package 


chip_lbs_spec 


Attributes 


-public_rdwr_accessors 


12.6.3 SysChain TAP Constants 
Defines 


SYSTAP 


32’d61 IR_LENGTH System TAP instruction length 
32’d43 SCH_IR_LENGTH System Chain instruction length 


32’d18 JTAG_IR_LENGTH ESI JTAG TAP controller’s instruction length 
32’d8 PCLTAP_IR_LENGTH PCle TAP controller’s instruction length 
327d5 SCH_TAP_IR_LENGTH | SCH TAP controller’s instruction length 


12.6.4 SysChain TAP Enumeration 


This enumeration allows code to select which TAP is to be operated upon. Software should assume the taps 
are layed out in the order specified by this enum; see R_SysTapInstrReg for that information as well. 


Enum 


SysChainTaps 


Product 
Silicon TAP 
PCI-Express TAP 
SysChain TAP 
CPU2 CPU 2 EJTAG TAP 


CPU0 CPU 0 EJTAG TAP 
CPUI CPU 1 EJTAG TAP 
CPUS CPU 3 EJTAG TAP 


CPUS CPU 5 EJTAG TAP 
5’h8 CPU4 CPU 4 EJTAG TAG 


Enum 


SysChainTapsT we 
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note: twc9 order TBD 


12.6.5 System TAP Instructions 
Description 


The System TAP instruction enumerations can be loaded into their respective JTAG TAP Controller IRs to 
select any one of the UDRs listed. Each UDR is documented further in the sections that follow. There is one set of 
enumerations per TAP Controller. Only the ICE9’s SysChain TAP enumerations are fully described in this spec. 
The remaining TAPs are fully documented in their respective specifications.” 

The SysTapEsilnstr enumeration below is a special case. The E-Silicon TAP IR is only 18 bits wide, but the 
enumerations are specified as 26 bits wide to accommodate unique enumerations for DR’s of different sizes using a 
single IR encoding. This is needed because the E-Silicon TAP supports an IEEE P1500 TAP controller as one of 
the devices that can be connected to its scan chain. The P1500 can connect DRs of different sizes depending upon 
what was written to the JPC or SMS IR, even though in each case the E-Silicon IR TAP encoding is the same. 
Thus the P1500 breaks the typical one-to-one correlation between the E-Silicon TAP IR selected and the associated 
DR length. In order to avoid maintaining state information in software to deal with the P1500; the enumerations 
in this table were widened to allow software to specify directly the context of which JPC or SMS WDR is being 
selected during the current scan operation. Note that in every case the least significant 18 bits of the encodings are 
identical. This is what is shifted into the E-Silicon TAP IR. The remaining 8 bits are not scanned into the TAP, 
but used by software to indicate the length of the associated DR register. 

For the ICE9, there are important deviations from the JTAG Standard within the E-Silicon TAP. The E-Silicon 
TAP uses an inverted TCK internally. When connected to JTAG scan chains that do not use the inverted TCK, 
this has the side-effect of inducing one extra clock of delay to the shift chain across the E-Silicon TAP. Therefore 
shifting data into or out of scan registers within the E-Silicon TAP require one extra TCK be inserted ahead of the 
shift. In the special case of reading the SMS 512K Test Algo. or Status Registers the E-Silicon TAP requires two 
extra TCKs be inserted prior to shifting data out of the register. 

In addition, all JPC and SMS WDR registers shift in a direction opposite of the normal IEEE JTAG standard, 
having their MSB connected to TDO and LSB connected to TDI instead of the other way round. Thus the contents 
of these registers may need to be bit-swapped, depending upon how a given JTAG bus master shifts its scan chain. 


Enum 


SysTapEsilnstr 


(TapSias) ED 


26°h00_3FFFE | IDECODE Device Identification Register* 
20H00-SFFFF | BYPASS 


?For the E-Silicon JTAG TAP see section <tbd> entitled <tbd> in the document <tbd>. For the PCI-Express JTAG TAP see 
Section 7.2 entitled “JTAG Interface” in the document, PCIe1™ 90mm PHY Databook. For the MIPS EJTAG TAP see Chapter 10 
entitled “EJTAG Debug Feature” in the MIPS647™ 5K™ Processor Core Family Software User’s Manual. 
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SELBCT.JPC.WIR [Seen PGWR SCENTS 
[2ch00-sFDTA | SELECTIPC.WDR | Seer IPCWDR——~dts——SCidY—SCCir YC”? 
SELECT=IPC.WDRSMSNUM | Select PC SMS Num Regier [sy —*+dtY¥ 
[2602.3 DTA | SELBCT.JPCWDR_BYPASS | Select IPG Bypass Register J 1—*#N——s«dN +d 
26°h00_3FE7A | SELECT_SMS_WIR | Select SMS.WIR 

SELECT SMS.WDR 
SELECT_SMS_WDR_TBX32K 234 
SELECT_SMS_WDR_TBX512K2P 556 
26°h03_3FF7A | SELECT_SMS_WDR_TBX512K1P 308 
26°h04_3FF7A | SELECT_SMS_WDR_STS32K 
26°h05_3FF7A | SELECT_SMS_WDR_STS512K 
26°h06_3FF7A | SELECT_SMS_WDR_BYPASS 


26°hOO_3FFE8 | EXTEST 
26’hOO_3FFF8 SAMPLE 
26’hO0O_3FFF8 PRELOAD 
26°hOO_3FFCF | HIGHZ 
26°hOO_3FFEF | CLAMP 


zl 2|<|< 
<|Z}<|< 


26°h00_3FF7A. 
26°h01_3FF7A 
26°h02_3FF7A. 


<TBD - Add the remaining E-Silicon TAP Instructions> 
* = Test-Logic-Reset Default 


Enum 


JpcSms 


Attributes 


-descfunc 


pont [BBS |_| chip bbs 
pena [CACO [ | chip psn 
pens CAC [| chip psi.cac 
pend CAC2 |__| chip ps2cac 
pons CAC |__| chip. ps8.cac 
pene CAGE [| ehip.pseac 
pen? GAGS |__| chip. psb.cac 
pens COMO [| ehip.coho 
peng COR [| chip.cohe 
sha CPUD [|__| chip pst. cpa 
Penh Pur [|__| ehip.psi-cpnmkt 
Pehe_[ OPU2___ |__| chip. ps2. cpm 
peha CPUs [| chip. ps8.cpn mk 
Pehe [| CPU |__| ehip.psepn mkt 
poht [Pus [| chip. s5.cpn mkt 
PehIy_[DDRE_ |__| chip.dtreddi___ 
Penn [| DDRO |__| ehip.daho.dai 
Pehla [DMA [| chip.dma 
penis [FSW ‘| chip. 


Enum 


SysTapPcilnstr 


(Tapsiae (Updater) 


8’hol IDECODE Device Identification Register* 
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[atid | USERCODE | Veer Cede Ragga 
[—ansr[orset | Control Regier ——SC—CidwSSC*dSCCY 


[snap_[exrest [Ewer SSCS CdSCOCC* 
[sic [EXTESTTRAIN | Beier waiaing ———SSSCSCSC*dBD) PCS 
[snGd_[EXTESTPULSE | Bxtest pulse ——SSCSCSC—~dIPTIBD) YP CCdSCOCC*d 
[snr [PRELOAD [Pred SSCS) PCS 
[—anri_[ SAMPLE [Sample __—————iaerep) | si sd 
[sure [Bypass | Bypas Gimme daveb) [2 +P SS Sd 


* = Test-Logic-Reset Default 


Enum 


SysTapSchInstr 


[soo [avrasso [| id vaso OOC—CSCSCS—S. Cid Cid 
[shor [DECODE |__| RSysTupDecode | Deviee Kentfeation Regte™ ———SS—=dt SCC 
a 
[soe [RESET [| RSyeTaphest [Reset Control Register| 
soa [ OPUDINT_[ TOBOA[RSysTupbine | CPU Debug Intorsupt Control Regier si Yd 
(shop [SMSBIsT |__| R.SysTupSmaBist_| SMS RAM BIST Control Regier [us| __¥ | 
[soe [scp | “Tver: | Berl Cotton Bas interns Rages a | 
[sop [ATMs [| RSysTapatoNp [attention MSP Reiter 
[shor | MENINEE | TWCOA | RSysTupMeminit | Memory Zero Register ————S—<dt 2 Si Yd 
[sno | SOBG4 [TWA RSysTupsebor | Serial Configuration Bus GED accom Regier [10a YY 
Sd 5 a CN 


* = Test-Logic-Reset Default 


<|<|<|<|<|<|<]<]< 


Enum 


SysTapCpulnstr 


[shor [IDECODE | Devies Mdeniicatin Regie” sid? SY CT ONY 
[—sinos [IMPCODE | Tmplementation Regier _——id SiS 
[sos [ADDRESS [Address Register _——SCSCSCSCS~—SBHSC“SC“‘YS SSS 
[shoo [DATA | Data Register SCS SC*dP Sd 


[suo [CONTROL | RITAG Comal Regier Sid Sid 
[su08 [ALL | Address, Data and BITAG Control Regites [132 _+| _¥ |v | 
[si00 | SITRGBOOT | Forces Debug Biception after Ret +f? | Sd 
[si10D | NORMBOOT | Fxccute rest handler ater Reset fr | sd 
[sir [evens [ayes SSSCSC—~idS Cd 


* = Test-Logic-Reset Default 
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12.6.6 System TAP Instruction Register 
Description 
The System Test Access Port Instruction Register consists of the all JTAG TAP IRs concatenated together. 
This is used only in ICE9, for TWC9 see R_SysTapInst. 
Class 
R_SysTapInstrReg 


Attributes 


-tapSize=61 


Poo pes [Wt |__| Silicon TAP Instruction Reater SSCS 
peas [Pas Wi |__| PCT Express TAP Instruction Register —SS—~S 
ato [Sch [| W__[1 |__| System Cham TAP Instruction Register ———*d 
29:25 [Opus [Wf |__| CPU 2 TAP Instruction Register SCS 


2E30 [Cpu [Wt _[____ CPU 0 TAP Instruction Register ——SS 
935 [Opal [Wt | ___ OPUT TAP Instruction Register ——S—S—SCS 
PTETo | Cpus___ [Wt [__ OPUS TAP Instruction Register SS 
pa5 [eps WPS TAP Instruction Register SS 
P20 [pad WP TAP Tstruction Register SS 


12.6.7 System TAP Instruction Register for TWC9 


Description 


The System Test Access Port Instruction Register consists of the all JTAG TAP IRs concatenated together. 
This is used only in TWC9, for ICE9 see R_SysTapInst. 


Class 
R_SysTapInstrT we 


Attributes 


-tapSize=81 


Cpul 


35 [Cpul__‘[ Wt 
30/ Opus [Ww [1 _| 
5 


TWCOA+ | CPU 8 TAP Instruction Register 
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12.6.8 Device Identification Register 


Description 


The Device Identification (IDECODE) Register contains the ICE9 and Sicortex device specific information in 
the IEEE 1149.1 JTAG Standard format. 


Class 
R_SysTapIDecode 


Attributes 


-tapSize=32 


31:28 | Version R pins Sicortex part version for the ICE9 device. Returns 1 for ICE9A0/ICE9BO, 
2 for ICE9A1/B1, etc. 


27:12 AddrProduct Sicortex part number for the ICE9 device. Always ICE9. 
SICORTEX | AddrTapMfer | JEDEC derived IEEE 1149.1 manufacturer identifier for SiCortex 


12.6.9 PLL Control Register 
Description 


The PLL Control Register chain has one control and status register for each PLL on the ICE9. The registers 
control the input signals described in Tables 12.2 and 12.3. The PLL Control Register chain also has a 3-bit register 
for each of the 2 PLL groups (Pllsw & Pllne) that makes one of the clocks in the group observable through pins on 
the chip. The order of the bits in the scan chain across the five PLLs and the two clock output control registers 
is shown in the attribute table below. The reset values should be such that the PLLs run at their nominal system 
speeds, to minimize the complexity of the ATE initialization sequence. 


Class 
R_SysTapPll 


Attributes 


-tapSize=64 


res [Cd Si Reseed OSCOSOSOSOOOCCCSY 
Ce 
Pei [elktock | R [0 | __] PMI Glock PL Look (I=locked, O=unloedy) = 
[o05s | Piiew | RW [0 |__| Clock output controt seater (oe Plsw desription baow) | 
[a7:55 | Pline | BW. 0 |__| Glock output contol refstor (se Pline description below). | 
[sa DickResr [RW [0 | | DDRI Controier Clock PLERewt SY 
[sao | pieikbiee | RW [23 |__| DDR Controller Clock PLL Divisor Fadon————SSS—=SY 
Pasar | DielkoutSd [RW [1 | DDR Controller lock PLE Output Sdect.———S—S—S—=S 
[a6 [ Diekesypcusa [RW [0 | | DDR Controller Clock PLL Bypass Clock Sele ——————*Y 
[as [DieikByptn [RW [0 |__| BDR1 Controller lock PLE Bypass Enable = 
[aa | ietktock [8 [0 |__| DRI Controfler Clock PLE Lock (=Iocked, O=unocked). | 
[as Dockteser [RW [0 | | DDRO Controller Cock PLER@t ———SSC~*SY 
[age | Doeikbiet [RW [23 |__| DRO Controfler Clock PLL Divisor Factor SiS 
[37:36 | DoekOutsa [RW [1 | | BDRO Controller Clock PLL Output Sdect.———SS—S—=*Y 
[a5 DoctkBypciSa [RW [0 | | DRO Controfler Clock PLL Bypass Clock Sele’ _—_——_—| 
[34 [DoetkByptinb [RW [0 | | DRO Controller Clock PLE Bypass Enable ———S—*Y 
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| 33 | DOckkLock = = = [|R = [0 ~~ |__| DDRO Controller Clock PLL Lock (1=locked, 0=unlocked). 

[2__[PaRetese | RW [0 | | PCI Referee Glock PLL Reet 
24 __| PeiRetByppivaBnb | RW [0 |__| PGI Reference Glock Bypass Divide by 2 Bnable = 
23_| PaiketBypiab [RW [0 |__| POL Reference Glock Bypass Puable. 
22 | PciRefLlock = =[R [o | | PCI Reference Clock PLL Lock (1=locked, 0=unlocked). 
21 | Saket [| FW] 0 |__| Switch Fabric SERDES Glock PLUReet. i 
18__| SelkBypbivaanb | RW [0 |__| Fabric Switeh and Links Clock PLL Bypass Divide by 2 Enable. _| 
2 | SakBypPub [RW [0 |__| Fable Switch and Links Glock PLL Bypass Enable 
11 | SckLok = = = =[R [0 ~~ |__| Fabric Switch and Links Clock PLL Lock (1=locked, 0=unlocked). 
10 | PakReset [RW 0 |_| Proasor Clock PLL Reset Ret 
2 | PaxBypbivama [RW [0 |__| Processor Glock PLL Bypass Divide by 2 Bnable =| 
i__[ PakBypenb | RW__|0 |__| Processor Clock PLL Bypass Bnable. id 

| PclkLok = =—h oJ RCT Processor Clock PLL Lock (1=locked, 0=unlocked). 


The PLL control register chain also has a 3-bit register for each of the 2 PLL groups (Pllsw & Pline) that makes 
one of the clocks in the group observable through pins (test-clk_o_h/1 in Pllsw or pci_ref_clk_h/1 for Pllne) See 
Table 12.13 below for a complete description of which clocks are made observable for each of these registers based 
on the settings of these two registers. 

For both Pllsw and Pllne, the reset default value causes the differential outputs to be tri-stated. With an 
operating PCle interface, an ICE9 would need to have the Pllne register set to select pci_ref_clk. For the ddr-clock 
PLLs, we also allow for driving out the XOR of the in-phase and 90 degree phase shifted PLL outputs. This allows 
for a crude measure of phase alignment of the 2 clocks; if they’re exactly 90 degrees out of phase, the XOR signal 
will have a 50% duty cycle. Since we won’t use a real analog mixer for the XOR, the resulting signal will be only 
a rough approximation to the ideal. 

In all cases, what’s driven to the output mux & LVDS output cell is taken from very close to the PLL output, 
ie., near the root of the clock tree, not tapped off the end of the clock tree. The provided functionality is for testing 
PLL operation, not the clock distribution network. 


Bit Field Read/Write | Value after Reset Description (Pllsw) Description (Pllne) 
<a> [0 [RW | 0] select no output (Hiz) | select no output (HZ) 


Table 12.13: Clock Output Control Register (2 copies) 


12.6.10 Reset Control Register 
Description 


The Reset Control Register allows the MSP to assert resets and enables on a unit by unit basis. All reset signals 
are SET after a hardware reset. All enables are CLEAR after a hardware reset. All reset and enable bits are 
directly read upon a SysChain Capture-DR operation and directly written on an Update-DR. 
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There are two types of reset implemented by the Reset Control Register; Unit resets and Virage STAR Memory 
System (SMS) resets. The Unit resets are used to initialize specific functional units within the ICE9. The SMS 
resets are used to reset the Built In Self Test (BIST) status for all the SMS RAMs in the ICE9. 

At power-on both Unit and SMS resets are asserted. The MSP will bring the ICE9 out of reset by first de- 
asserting the SMS resets so that BIST can be performed on all RAMs that support it while keeping the Unit resets 
asserted. In the ICE9, BIST is used not only for testing RAM but also to initialize some RAMs into a useful 
state for system bring-up. During BIST, it is necessary that each functional unit that contains SMS RAM be held 
in reset, to prevent improper operations from being induced by the BIST activities. Once BIST has successfully 
completed, the MSP will bring the functional units out of reset by de-asserting the appropriate Unit reset bits as 
part of system bring-up. 


Restrictions 


Whenever the MSP is changing more than one of these bits in a single Update-DR operation, it must not set 
bits while clearing others. All multi-bit operations must be isotonic (all set or all clear). This restriction avoids 
race hazards in downstream logic that may use combinatorial expressions made from more than one of these bits. 


Class 
R_SysTapReset 


Attributes 


-tapSize=64 


CC 
LAC reset. Prior to TWC9, this was ganged into the Scbm reset. 
1 PMI reset. Prior to TWC9, this was ganged into the Scbm reset. 
RW Ox3F Reset for Processor 9:6 SMS (Pclk). See -ProcSms6. 


Ghee) Reserved. FIX; spread ProcSms6 and Proc6 to allow 
Peas ol |__| room for CPULOS 

iw 

RW 
RW 
RW 
RW 


Ox3F Reset for Processor 9:6. See Proc. 
SMS Clock Enable 


Reset for DDRO controller SMS (D0clk). 


ie et Reset for DDR1 controller SMS (Diclk). 


Reset for COH and DDI even SMS (Cclk). 
Reset for COH and DDI odd SMS (Cclk). 
Reset for Fabric Switch SMS (Sclk). 


a 


a 


CswOclaSms | RW | 1 s,s Reset for Central Switch OCLA SMS (Cclk). 
[30__|Sebmsms [RW [1 |__| Reset for SOBM SMS (Gal. 

F20_| Bbssms RW [1 |__| Reset for BBS SMS (Cell) 

RW P__| Reset for POI SMS (Fel) 


27:22 | ProcSms RW Ox3F Reset for Processor 5:0 SMS (Pclk) [six resets, one per SMS]. See 
also _ProcSms6. 


a 
FE 
a 
ris [par wf [|__| Reser for DDR controler 
pat [ote Paw Reset or CO md De 
ps eote Paw Rese or CO md DDKodd 
ps [rise Paw tid Reset or absicwitch, SSS 
pa [btn Paw CY Rese for tabi inks SSCS 


BelelRele 
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RW 1 Reset for SCB. Prior to TWCQ9A, this also reset the BBS including 
the OCLA. 


paw i Reset BOOS 


5:0 Proc Reset for Processor 5:0 [six resets, one per processor]. See also 


_Proc6. This will reset all processor registers, excluding the 


R_IcetxTime register. 


12.6.11 Memory Init Register 
Description 
The Memory Init Register allows the MSP to initalize on chip memories on a unit by unit basis. Memories are 
NOT reset by default and the MSP must use this register to insure proper memory state. 
Class 
R_SysTapMemInit 


Attributes 


-tapSize=32 


pare i ( TWODAT | Reserved. (For extending OPUs to IBI)———*d 
[25:6 [Opa RW | TWOOAT 
i aS! ea 


Init PMI 


[Init Fabric Switch (Sc). SSCS 
F [Init DMA Engine (Cok). SSS 
). 


Init Central Switch OCLA (Cclk 


+ | Init Level 2 Cache (Cclk). 
Init SCBM (Cclk). 
Init BBS (Cclk). 
Tnit PCI (Iclk). 


TWC9A+ | Init busy. To initalize a memory, software writes the 
appropriate bits one. This bit will then remain cleared 
until all RAMs are finished, at which point it will read 
as a one. Software must then zero this register. Once the 
register is zero, the MSP has the option of initalizing 


other memories. 
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12.6.12 Processor Debug Interrupt Register 
Description 


The Processor Debug Interrupt Control Register allows the MSP to send a Debug Interrupt (DINT) request to 
one or more MIPS cores in the ICE9. The MIPS EJTAG Specification specifies that a debug interrupt is requested 
when the DINT signal transitions from low to high.? The associated MIPS core is allowed to synchronize this signal 
to its own clock before detecting its rising edge. Section 8.2.2 of the specification also states that the DINT high 
and low times must observe a minimum of 1uS in order to leave enough time for the CPU core to synchronize the 
DINT signal to its internal clock domains. The DINT signal rise/fall times are also specified for a maximum of 
3nS. The MSP and associated logic should observe these restrictions for bits in this register. 

This register only exists in ICE9A. In ICE9B it was moved to R_ScbDInt. 


Class 
R_SysTapDint 


Attributes 


-tapSize=8 


A SO 
5 | Cpabmemab [RW [0 | 1CHOA | Bhable any processor oF OOEA to send a debug interrupt to al proeesors_| 
Ps [ines [RW [0 [ 1CBDA | Processor Core 5 Debug Tnerrupt (on transition om Ow 1) | 


pa [pina RW [0 1CB9A | Processor Core Debug Tnterupt (on transition Rom Oto 1) 
Ps [pines [aw [0 [16508 | Processor Core Debug Taterrupt (on transition From Oto 1) 
-2 | Dina [Rw [0 | 1CB9A | Processor Core 2 Debug Interrupt (on transition from Oto 1). | 
Pi [pine [wo [16808 | Processor Core 1 Debug Tterrupt (on transition om 01). | 
}o =| Ditto =] RW co sO CEA Processor Core 0 Debug Interrupt (on transition from 0 to 1). 


12.6.13 SMS BIST Contol Register 
Description 


The SMS BIST Control Register allows the MSP to initiate BIST on all of the Virage SMS RAMs inside the 
ICE9. To insure proper operation, BIST should only be initiated after every SMS reset has been de-asserted in the 
Reset Control Register. SMS BIST performs RAM tests, loads the memory fuse map and performs initialization 
on those RAMs that require specific data initialization prior to normal operation. This is important for Tag arrays 
and some other memory structures that, because of the BIST requirement, can’t be initialized under reset. BIST 
is activated via the Virage SMART signals; which are entirely separate from the P1500 port connected to the test 
JTAG chains. The attribute table below shows the format of the register. 

For chips installed in systems, all Virage BIST operations are completed while unit resets are asserted, see 
12.6.10. This includes the INITIALIZE operation. The proper behavior for all components on the chip that have 
RAM arrays is to clear all address registers to 0 while RESET is asserted and the RAM is not in INITIALIZE 
mode. While RESET is asserted and the RAM is in INITIALIZE mode, the hardware should clear all locations to 
a known and repeatable state. INITIALIZE and BITS commands should be ignored when RESET is not asserted. 

The MSP prepares for Virage BIST by first clearing all of the SMS Resets in the Reset Control Register, making 
sure that the Unit resets remain asserted to prevent unpredictable hardware operations while BIST is running. The 
MSP then enables Virage BIST by asserting both SmartEnb and SmartRun bits in R_SysTapSmsBist and then 
de-asserting SmartEnb. BIST is complete for all SMS RAMs when the SmartDone bit is asserted. The MSP must 
poll this bit to determine when BIST has completed. After BIST completion, the MSP can examine the SmartFail 
bit to determine if BIST passed or failed. The MSP should be aware that one of the SMART failure modes is 
the inability to complete and should timeout after a suitable polling period has elapsed and SmartDone has not 
asserted. 


3“RJTAG Specification”, Revision 3.10, MIPS Technologies document number MD00047. 
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SMART activity can be altered by writing to the other control bits in this register prior to setting the SmartRun 
bit. For normal system bring-up the MSP should leave the other writable bits at their reset defaults. This allows 
SMART testing to load the hardware programmed repair mask before running BIST across all SMS groups. De- 
asserting the SMS CLK Enable bit in the Reset Control Register will inhibit BIST operation. 

Class 


R_SysTapSmsBist 


Attributes 


-tapSize=16 


OO 
Ps SmanDone [| 0 | __[ SMART Done (0=notdone Tdoney SSCS 


"6 SmartFail R SMART Failure (only valid after SmartDone bit is set; 0=passed, 
1=failed). 


SmartReady R 1 BIST Group Done (signals when the current SMS BIST group has 
finished). 


opens. et BIST Group Failure (signals when the current SMS BIST group has 
failed). 


Pa [Reais [| na pt oF SMART ttn 
| 3 | HardRepair | RW [1 | | Use hardware programmed repair mask (enable before BIST). 
ee Use software Biernned repair mask (leave disabled). 


seep Runs SMART on transition to ’1’, clears SmartDone on transition 


12.6.14 Serial Configuration Bus Interface Register 
Description 


The Serial Configuration Bus (SCB) Interface Register allows the MSP to communicate with devices on the 
SCB. All chip clocks need to be running when the SysChain accesses the SCB. In chip test mode we ensure this 
by putting all but the fabric SERDES clocks in bypass mode. In a system we ensure this by either tying all clocks 
into bypass mode to SCH_TCK or by starting all the PLLs. 


Any register on the SCB may be written from the SysChain. The SCB mechanism is particularly useful in 
testing the fabric link hardware. The attribute table shows the layout of the SCB scan register. There is just one 
SCB scan register on the ICE9 chip. 


Class 


R_SysTapScb 


Attributes 
-tapSize=64 
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63:32 | Data RW Read/Write Data. On writes, data to be written. On reads, when 
Busy is cleared, the read data. 


31 Reset RW Reset SCB slaves. Applied when _Go set. On the next “Go”, before 
sending the read or write transaction, first send a RESET. This is 
a method of last resort - one short of asserting a real reset wire - to 
allow hung slaves to be accessed. 


[aoa [Aad [RW [0 | ___ Address. Applied when Goses SSCS 


Write, not read. Applied when _Go set. Assert for writes, clear for 
reads. 

Command busy. SCB sets this return on a “Go” and clears it when a 
SysChain write completes or a read returns data. Overlaps allowed. 
Go and start command. Must be a one for the SCB to process this 


command. The SCB will then clear this bit in the response. 


Class 
R_SysTapScb64 


Attributes 


-tapSize=104 


Reset RW TWC9A+ | Reset SCB slaves. Applied when Go set. On the next “Go”, before 
sending the read or write transaction, first send a RESET. This is 
a method of last resort - one short of asserting a real reset wire - to 
allow hung slaves to be accessed. 


TWC9A+ | Command busy. SCB sets this return on a “Go” and clears it 
when a SysChain write completes or a read returns data. Note 
the R_SysTapScb register has this bit overlapped with _Go, here it 


is separate. 


a 
94:66 Address bits 30:2. Saved when _Go set. 


Write, not read. Applied when _Go set. Assert for writes, clear for 


reads. 


Doubleword access. Applied when _Go set. Indicates this transac- 
tion is 64 bits instead of 32 bits. Note 32 bit transactions write and 
return data in naturally aligned position, that is if -Addr[2] is set, 
then _Data[63:32] is used. 


63:0 Data RW TWC9A+ | Read/Write Data. On writes, data to be written. On reads, when 
Busy is cleared, the read data. 


12.6.15 MSP-Hosted Node Attention Register 


R_SysTapAtnMsp provides the MSP manipulated side of the MSP to node chip communication channel. When 
used in conjunction with the node chip manipulated register RScbAtnChip, two-way communication can be pro- 
vided via the SysChain between software running on the MSP and software running on the node chip. See 10.14.12 
for a more detailed description of the R.ScbAtnChip register. 

To send a 25-bit character to the chip, the MSP polls until SendVld is clear. The MSP then writes SendData 
and writes a one to SendVld. To receive a 25-bit character, the MSP polls for RecvVld set, reads the data from 
RecvData and then writes a one to RecvTaken. 

Note that a register read or 8 SysChain clocks must occur after any write to this register for the write to take 
effect (see 12.6.1). 
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Class 
R_SysTapAtnMsp 


Attributes 


-tapSize=32 


SendReq ce ICE9B+ Send Data Request. Write one to set and indicate new send da 
for chip. This will cause _SendVld to assert. 


TxAtnMask ICE9B+ Transmit Attention Mask. Write one to indicate sys_atn_] pin shoul 
be asserted if SendVld is clear, indicating new data may be sent. If 
clear, sys_atn_l is not asserted for this reason. Note that _SendVld 
is clear in the idle steady state, so to prevent permanent attention 
this bit should be cleared when there is no data to be sent. Overlaps 
Allowed. 

NonComAtn ICE9B+ Non-Communication Attention Request. Attention is required 
for other then AtnMsp register reasons. A duplicate of the 
R_ScbAtnInt_NonComAtn bit to avoid the MSP having to change 
instruction registers in the fast path. (Note writing this bit has no 
effect, so old ICE9A code that writes _RecvAtn will NOP.) 


29 RecvAtn RW ICE9A Receive Attention Enable. Write one to indicate sys_atn_l pin should 
be asserted if -_RecvVld is also asserted. If clear, sys_atn_] is never 
asserted for this reason. Overlaps Allowed. 


28 RecvTaken W1C Receive Data Taken. Write one to send to chip indication that 
pep} — RecvData was accepted, and clear _RecvVld. 

RecvVld Receive Data Valid. Valid flag from Chip, one indicates Data con- 
a (a ie tains new receive data. Cleared by writing one to _RecvTaken. 


SendVld RWI1CS(*) 2 Text) | Send Data Valid. ICE9A: RW1S5; write one to set and indicate new 
send data for chip. ICE9B+: Read only, write using _SendReq in- 
stead. BOTH: Read to indicate send data pending for chip. Cleared 
when chip takes the data. 

: RecvData Receive Data. Overlaps SendData. 
If RecvVld is set, returns the next data to be received from the 
MSP. Note this is different data then that written. 


25:0 SendData W Send Data. Overlaps RecvData. 
If SendVld is simultaniously being written with a one, enqueues new 
send data for the chip, and sets SendVld. 
If SendVld is not being written with a one, this is ignored. This 
enables the MSP on a read of this register to set only bit 29 inbound, 
and not recirculate other bits. 


12.6.16 External JTAG Chains 


Figure 12.8 shows how the SysChain and JTAG TAPs are connected on the CPU module. All nine ICE9 TAPs 
are connected in series, and share common TRST, TCK, and TMS lines. TDI, TRST, and TCK are distributed 
module-wide; all ICE9s see the same values of these signals at all times. TMS is separately distributed to each 
ICE9 to facilitate manipulating a subset of the ICE9s on a module without having to place the others in reset or 
bypass mode. TDO is individually multiplexed from each ICE9 to allow the MSP to receive a single ICE9’s serial 
data even if multiple ICE9s are being scanned. 


12.7 Global reset 


The ICE9 chip implements a 2-level reset strategy. Hard-reset (normally asserted at power-on) is a chip pin. 
To provide for reset of parts of the chip under module-service-processor control there are soft-reset bits from the 
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Figure 12.8: ICE9 E-Silicon and SysChain JTAG TAP Connections 
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SysChain’s Reset Control Register (See Section 12.6.10.) which OR into the reset distribution for the relevant parts 
of the chip. Distribution of hard-reset assertion is asynchronous; distribution of the de-assertion is synchronous 
within a PLL clock domain. Hard-reset is distributed to all resetable logic on the chip. Assertion of the soft- 
resets is synchronous to the SysChain_TCLK scan clock, which is asynchronous w.r.t. the clocks for the logic 
being reset. De-assertion of the soft-resets is synchronous after passing through the dual-rank synchronizer for the 
appropriate clock domain, like the de-assertion of hard-reset. Perhaps the Figure 12.9 will make things clear. The 
RCREG_RESET_CCLK|[*] and PLLCREG_RESET_PLLC pins are signals from the SysChain reset vector (section 
12.6.10), hence they are in the SysChain_-TCLK scan clock domain. The 2 flops form a dual-rank synchronizer to 
bring the signals into, in this case, the cclk domain. The gates downstream provide an asynchronous path around 
the flops for the asserting edge, so that only the deasserting edge is synchronous in the cclk domain. For the 
PLL_resets (RESET_PLL_C in the figure), both edges must be asynchronous, since the clock will not be running 
to clock the flops until the deassertion of reset propagates to the PLL. 


similar circuitry for domains D, S, I and P 
SCAN_MODE 


LS : 


RCREG_RESET_CCLK[n-1:0-\7—\ / ) n 


RESET_C[n-1:0] 


PLLCREG_RESET_PLLC 


S) 

CCLK —— — 
lf 

ED) 


RESET_PLL_C 


SYS RST 
Figure 12.9: Reset Distribution for the CClk domain 
For logical clarity the figure is drawn without any indication of signal assertion level. RESET_C[*] is the normal 
reset and would be used for most logic. 


There will be a number of reset signals, one for each part of the chip which needs to be reset separately under 
control of the module service processor. The distribution of resets and clocks are shown in Figure 12.10. 


12.8 Boot Timeline 


This section describes the order of system bring-up from outlet-power.* Specifics on power sequencing, etc, may 
be found in the system specification. 


12.8.1 SSP Boot Timeline 
1. On power being applied to the cabinet, the first thing to power up and boot is the System Service Processor. 


2. Whether automatically or on command from an administrator, the SSP enables power to the CPU modules. 


12.8.2 MSP Boot Timeline 


1. Once power is applied, the hard reset pin, sys_rst_l, and sch_trst_l are asserted to every ICE9. This is done 
with hardware even before the MSPs (module service processors) boot. The sys_clk is insured to be running 


“Other documents reference the step numbers in the sections that follow. It is highly recommended that the ordering of existing 
steps remain unchanged. Adding steps to the end of a list is safe, but if additional steps must be inserted into the middle of a list, add 
them at an indented level as a,b,c,... etc. If a step must be removed from a list, keep the step, but replace its text with an italicized 
comment; such as: This operation removed; continue to the neat step. 
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Figure 12.10: Reset & Clock distribution block diagram with real net names 
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so that sys_rst_l propagates throughout every ICE9 as described below. 


2. Each MSP boots from its internal flash. The MSPs request a kernel and application from the SSP, which 
serves them via TFTP or similar mechanism. 


3. Each MSP turns on the DC-DC converters that power the ICE9 chips on its module. 


4. Each MSP begins an orderly bring-up of all the ICE9 chips on its module, in parallel. 


12.8.3. Pre-DRAM Boot Timeline 


SYS_RST | 


SYS_CLK 


internal clock UUWU UU UU TU il 


internal reset 


SCH_TCK LAH] il 


1 2 3 + 
Figure 12.11: Reset Timing 


1. The power-on assertion of sys_rst_l at the ICE9 has two effects on the ICE9. First, the PLL reference clocks, 
sys_clk_e_h/] and sys_clk_o_h/1, bypass their respective PLLs, so that all domains are clocked by sys_clk. Since 
most ICE9 resets are pipelined and are therefore effectively synchronous, this ensures that reset propagates 
throughout the chip. Second, sys_rst combinationally bypasses the sys_chain reset register, so that reset is 
applied without the need for any sys_chain scan activity. In figure 12.11, “internalreset” represents a reset 
signal in any clock domain. 


2. At some later time (sch_tck activity block 1 in figure 12.11), the MSP resets the sys_chain TAP controller 
and the six EJTAG TAP controllers, which share a common 4-wire TAP, by issuing two sequences of five 
TCK pulses with TMS asserted, followed by one TCK pulse with TMS de-asserted. This sequence guarantees 
that all TAP controllers in the sys_chain are reset and left in the RUNTEST/IDLE state and, by running the 
sequence twice, that the Virage Fuse-ROM values have been properly loaded. Every TCK pulse with sys_rst_l 
asserted initializes the PLL, Reset, CPU Debug Interrupt, and SMS RAM BIST control registers, and the 
SCB interface and Attention MSP registers, to their reset values. 


3. Subsequently, the MSP stops asserting sys_rst_l. However, all internal resets remain asserted because of the 
initialization of the Reset control register. The PLLs are no longer bypassed with sys_clk; each will run at 
a frequency determined by the values with which the PLL Control register was initialized. The allows RAM 
BIST activity via ATE to proceed at an appropriate speed. 


4. The MSP must poll the lock status of each PLL until either its lock bit is set or the MSP times out waiting 
for lock. This is shown as SCH_TCK activity block 2 in figurel2.11. Prior to polling for lock, the MSP may 
at this time make changes to the PLL control values. The correct method for changing any PLL control value 
is to first assert reset to the PLL to be changed, changing the control value and then de-asserting reset to 
the PLL followed by polling its lock bit for assertion. Setting the PLL control registers can be skipped if the 
values established in step 3 are the desired values, however in all cases the MSP must poll each PLL for lock 
before proceeding to the next step. 


5. Prior to Virage BIST, the MSP deasserts all SMS Resets in the Reset Control Register, leaving all normal 
resets asserted. The MSP then enables Virage BIST and waits for the results and reads them back (see section 
12.6.13 for details). After successful BIST completion, the Virage RAMs will have been cleared and the MSP 
de-asserts the SMS CLK Enable bit in the Reset Control Register to prevent further BIST operation. This 
is shown as SCH_TCK activity block 3 in figurel2.11. 


May 14, 2014 630 Rev 51328 


SiCortex Confidential 12.8. BOOT TIMELINE 


10. 


11. 


12. 
13. 


14. 


15. 


(a) After Virage BIST, the MSP must bring the ICLK PLL out of reset. It is the only PLL that comes 
up held in reset state by assertion of sys_rst_l. To bring the ICLK PLL out of reset, the MSP must 
first insure that the PCI Reference Clock PLL is in lock (done in step 4 above). The MSP then must 
write 3’b1 to bits <57:55> (the Pline field) of the sys_chain register R-SysTapPIl to insure that the PCI 
Reference Clock is driven onto the ICE9 pins. The default value of this field after reset is 3’b0 which 
would leave the PCI Reference Clock pins in HighZ mode. 


After the MSP has set the R_SysTapPIll register to deliver the PCI Reference Clock to the chip pins, it 
then de-asserts the IclkReset bit in the PLL control register and polls the IclkLock bit for assertion. At 
this point the ICLK PLL is operating normally and is no longer in reset. 


agi 
= 


. The MSP sends an EJTAGBOOT instruction to each of the 6 processor EJTAG controllers. When reset is 


released in a later step, this will override the default fetch from 1FC0000 at reset, and instead immediately 
cause the CPU to take a debug exception and wait for instructions over EJTAG. 


. The MSP then deasserts the internal reset signals for all functional blocks (see 12.6.10). This is shown as 


SCH_TCK activity block 4 in Figurel2.11. 


(a) Note: UartIoEnb is left de-asserted. This will be bundled into whatever code the MSP uses to mux 
Serial I/O to the ICE9s. Whenever a connection is opened to a particular ICE9, that chip’s UartIloEnb 
will be asserted at that time. It will be de-asserted when the MSP closes the connection. 


. The MSP uses the SysChain/SCB interface to load the module number into R_ScbChipNum (see 10.14.7.) 


. The MSP scan in of EJTAGBOOT in step 6 and release of reset in step 7 causes the CPUs to wait for EJTAG 


instructions. The MSP sends the initial boot routine (boot0.s) to CPU 0 only and then force jumps it via 
EJTAG to the start of the boot0 image. 


CPU 0 begins running the boot0 image. The boot0 routine initializes the register file, TLB and caches and 
copies the boot1 routine (boot1.s) from the MSP into the L2 cache. Boot0 then jumps to the boot1 image in 
the L2 cache. 


Bootl begins executing from the L2 cache. At this point the only memory-system difference from normal 
operation is that the DDI initializes in a mode which returns bogus data on reads; otherwise the normal 
L1/L2 write-allocate would hang on the first miss. 


CPU 0 starts a memory copy loop, which reads from the EJTAG debug region and writes the L2 cache. 


The MSP sends the second boot image to CPU 0 (boot2), using the FASTDATA EJTAG command. This 
requires “71 shifts per 64-bits of data, or ~2.5 seconds for 256KB at 1 MHz sch_tck. The entire image is 
limited to the L2 cache size, or 256 KB; if this is exceeded the DRAM would need to be initialized before this 
loop to prevent the L2 from creating victims. 


The MSP boot image also includes configuration data for the boot process, including PCI-connected and 
DRAM frequency information. 


When the copy loop completes, CPU 0 executes the code. This image starts the next phase of the boot 
process. 


12.8.4 DRAM Boot Timeline 


1. 


The newly installed cache code initializes DRAM. This includes reading the DIMM I2C configuration, pro- 
gramming the controller, and testing/zeroing memory. (Of course, the code needs to be careful not to overwrite 
or evict itself until it competes the memory copy. One alternative is to have stage one boot load at 31GB - 
above where there will be memory.) 


. The code performs the BIOS-ish initalization required prior to kernel boot. 


. After DRAM is initialized, the boot0/boot1 step described for CPU 0 is repeated on CPUs 1-5. The download 


copy loop steps for boot1 are skipped, as the bootl image is already in the L2 cache. CPUs 1-5 jump directly 
to boot1 at the end of boot0. 
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4. Now running from the caches, CPUs 1-5 enable interrupts and execute a WAIT instruction, which will put 
them to sleep until they receive an interrupt from CPU 0 during kernel boot. 


5. The kernel loader is copied into DRAM by CPU 0 from the MSP via the EJTAG FASTDATA command. 
At the completion of the copy, the MSP force jumps the CPU 0 into the kernel loader. Unlike the previous 
memory copy loops, the kernel loader performs decompression and checksumming of the kernel. 


6. The MSP receives the compressed kernel image fromthe SSP and uses the EJTAG FASTDATA command to 
transfer it to the kernel loader running on CPU 0. 


7. Upon successful completion of the kernel download, it is executed. 


12.8.5 Kernel Boot Timeline 


1. The kernel performs its normal boot sequence. 
2. When kernel boot is complete, CPU 0 sends a interrupt to CPUs 1-5, which releases them from WAIT. 


3. The kernel asks the MSP to enable the watchdog timer. (Or, more correctly, switch from a very long wait- 
for-boot timeout to a shorter heartbeat timeout.) 


4. The fabric and DMA drivers initialize the fabric switch, links, and DMA engine as described below. 


5. Login :) 


12.8.6 Booting the Fabric Switch and Links 


1. At power-on, the fabric links, fabric switch, and DMA are held in reset by bits in the RSysTapReset register. 
Deassert reset to FSW (clear FabSw bit in R_SysTapReset). 


2. Configure the FSW registers through writes on the SCB. See the FSW chapter for details on each register. In 
particular, in R_FswBlockReset, deassert reset on all blocks. In R_FswBlockEnable, enable all blocks. The 
FSW is now ready to transfer packets to/from the links and DMA, but nothing will happen yet since the 
links and DMA are still in reset. 


3. Bring fabric links out of reset (clear FabLn bit in R-SysTapReset). 


4. Configure the FL registers through writes on the SCB. Bring up each link into MissionMode. See the Fabric 
Link chapter for details. 


At this point, the ICE9 can accept packets from its three upstream neighbors and send them to its three downstream 
neighbors. The MSP or a processor can use out-of-band communication channels, watch packet statistics, set and 
clear interrupts, etc. This ICE9’s DMA engine cannot send any packets because it is still in reset. Any packets 
coming from upstream that are destined for the DMA flow through the fabric switch to the DMA RX port, which 
because it is in reset, will accept the packets and drop them. 

On nodes with BIST, DRAM and other failures preventing Linux boot, the MSP will be able to initalize the 
fabric by this process using the SysChain/SCB alone (without any cpu core). 


12.8.7 Booting the DMA Engine 


It is assumed that the fabric switch and links are already initialized as described in the previous section. 

For the processors to communicate with the fabric (other than reading and writing CSRs), they must boot the 
DMA engine. The DMA engine must be configured by processors, because many of the configuration registers are 
accessible only through the CSW. 


1. Bring the DMA engine out of reset (clear Dma bit in R_SysTapReset). 


2. Configure the DMA engine 


(a) Write zero to every location in R-DmaDmem and R_DmalImem. 
(b) Write DMA microcode to R-DmaImem and initial data to RCDmaDmem from the DMA loader file. 
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(c) Initialize the DMA microcode application data as described in the Initialization section of the DMA 
chapter. For example, the application needs the physical address of various queues and data structures. 


3. Start the DMA Engine by setting all ThreadEnable bits in R-DmaThreadSel. 


4. Deassert reset to the DMA’s TX and RX ports in R-DmaBlockReset. This allows packets to begin to flow 
between the DMA and fabric switch. 


12.8.8 Rebooting with Fabric Still Up 


The ICE9 allows the fabric switch and links to be operated even while the rest of the node is being reset. As 
long as the FabSw and FswLn bits of R_SysTapReset are deasserted, the fabric switch and link will continue to 
route fabric traffic. This allows the ICE9 to be rebooted without backing up the fabric. When software has decided 
to reset the chip without affecting the fabric, the sequence of events is as follows: 


1. Disable the crosspoint buffers in the fabric switch leading from DMA to the fabric transmitters by clearing 
R_FswBlockEnable bits for XB30, XB31, and XB32. This prevents any new DMA traffic from flowing into 
the fabric. (Using R_FswBlockEnable instead of R-FswBlockReset stops traffic on clean packet boundaries.) 


2. First shut down the DMA (cleanly if possible) and then assert reset to the DMA. Once the DMA is in reset, 
its RX port accepts incoming packets and throws them away, and its TX port will not send anything else. 


3. Reset anything else in the chip that is needed. At the point in the boot process that the fabric switch would 
be initialized, you need to detect whether the fabric switch and link are already running. For example, the 
detection could be based on whether FSW and FL are already out of reset, or if fabric links are in mission 
mode, or it could be based on nonzero packet counter statistics. If the FSW and FL are not running, you 
would initialize them as described in section 12.8.6. If they are already running, continue with this sequence. 


4. Enable all blocks in R-FswBlockEnable. 


5. Proceed with Booting the DMA Engine, described in section 12.8.7. 
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Chapter 13 


PCI Express Subsystem 


[$Id: chippci.lyx 50693 2008-02-07 16:01:46Z wsnyder $] 


13.1 Overview 


The ICE9 chip includes a PCI-Express root complex subsystem. The PCI-Express subsystem provides the ICE9 
cores with access to PCI-Express peripheral chips either on the processing module or on external cards. While 
the subsystem typically talks to just a single PCI-Express device, there is no hardware limitation that prevents 
implementation of more complex topologies. 

The specifications for tphycr 

he PCI-Express subsystem are: 


Implements a root complex. 

Supports packet sizes of 128B, 256B, and 512B. 
Supports end-to-end CRC checking. 

Supports one virtual circuit. 

Supports PCI-Express power off mode L3. 


Translates CPU physical addresses to/from PCI addresses. See Chapter 16. 


13.2 Differences, Bugs, and Enhancements 


13.2.1 Product and Chip Pass Differences 


1. 


ICE9B fixes legacy interrupt D behavior incorrect during a link down, bug1984. In ICEQA if an AS- 
SERT_INTD message arrives from the endpoint, software will service the interrupt. During this time, if 
the link goes down, an implicit DEASSERT_INTD should occur, but this did not happen. So if the interrupt 
service routine ends with a wait for DEASSERT_INTD”, and it is possible that it will hang forever. 


. ICE9B fixes ecc error ignored when CLEAR comes at the same time, bug2028. In ICE9A if an ECC error 


is in effect and the interrupt is raised. Some time software clears the interrupt and an ECC error comes at 
the same time (in PMI where is checks, or not checks, for ecc error and clear), PMI ignores the second ECC 
error. 


. ICE9B fixes the MsiBaseAddr register addressing, bug2097. In ICE9A, software has to program the PMI 


MsiBaseAddr register with an Ice9 address converted into a PCle space address (look at the address mapping 
in the hardware spec). 


. ICE9B fixes RX detection not being completed when some lanes are disabled, bug2113. In ICE9A, when one 


or more lanes of a multi-lane link are disabled using TxCompliance/TxElecIdle as described in Section 8 of 
the PIPE specification, initiating a receiver detection sequence will cause the PCS layer to hang due to the 
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“turned off’ lanes not performing the receiver detection operation. To workaround, enable all lanes prior to 
performing a receiver detection operation, as lanes which are turned off will not participate in the receiver 
detection sequence. 


5. NEED IMPL: TWC9A fixes only the bottom 16 bit being writable in R-PmiVmReqDat, bug2760. We 


couldn’t find any PCIe vendor which uses vendor messages, so this is of only minor concern. 


13.2.2 Known Bugs and Possible Enhancements 


1. None. 


13.3. Internal Structure 
The PCI-Express subsystem consists of six layers: 
1. The PHY layer, which implements the 2.5Ghz SerDes used for PCI-Express I/O. 


2. The PCS layer, which converts parallel data binary data received from the MAC layer to 8B/10B encoded 
serial data for the PHY. 


. The MAC layer, which implements the physical connection path for PCI-Express. 


eK WwW 


. The link layer, which implements the logical connection path for PCI-Express. 
5. The transaction layer, which implements PCI-Express transactions and queues 
6. An application layer, which interfaces between the L2 cache and the transaction layer. 


Layer 6, the application layer, is designed by SiCortex, and is synthesized RTL. Layers 3-5, the transaction, link, and 
MAC layers, are part of the PCI-Express controller core. This core is purchased from Synopsys and is synthesized 
RTL. Layers 1-2, the PCS and PHY layers, are part of the PCI-Express PHY core. This core is purchased from 
Synopsys. The PCS layer is synthesized RTL. The PHY layer is a hard macro. 


13.4 Known Bugs and Enhancements 


1. R_SysTapReset_Scb was originally intended to reset only the SCB and the OCLA LAC. However, in ICE9A, 
ICE9A1 and ICE9B this also ends up resetting the cclk parts of the PMI. This was not intended. In future 
revisions of the chip an additional bit may be added to the R_SysTapReset register to allow for resetting the 
PMI without resetting the SCB and OCLA LAC. bug2929. 


13.5 Process Requirements 


The PCI-Express PHY core requires 2.5V or 3.3V thick oxide and input voltage for its analog circuits. For its 
90nm general purpose (G) process, TSMC offers either a dual oxide option (1.0V/2.5V) or a triple oxide option 
(1.0V/1.8V/3.3V). Since the DDR PHY is a dual-process DDR/DDR2, 2.5V/1.8V design, it does not require 
1.8V oxide, and ICE9 will use the dual oxide option; thus the PCI-Express PHY will run off 2.5V, as will all 
general-purpose IO buffers and PLLs. 


13.6 Application Layer and the PMI 


PMI is the unit name for the PCI controller and all application layer components. This unit also includes the 
interface between the L2 cache switch (CSW) and the miscellaneous I/O units including the UART, I2C and the 
SCB. Figure 13.1 shows the top level block diagram of the application layer and its connection to the root complex. 

The PMI is comprised of 5 major pieces. The CSI is a control/status register interface that allows processors 
to perform I/O register reads and writes to the UART (see Section 15), the I2C controller (Section 14), the Serial 
Control Bus (Section 10), the RC’s configuration port (DBI), the RC’s vendor message interface (VMI), the RC’s 
system information interface (SII), the PCI PHY configuration port and internal control and status registers. The 
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Figure 13.1: PMI Block Diagram — The Application layer between the CSW and PCI Root Complex 
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Figure 13.2: REQ Unit 


REQ handles all requests from processors and the responses generated by the PCI network. The CMP handles 
inbound requests from downstream devices and generates completion events in response to the requests. The CMX 
is the command multiplexer and the DMX the data multiplexer. Each of these components are described below. 

It is important to note that the PMI contains logic that runs in TWO different clock domains. The RC is driven 
by a fixed frequency ICLK at 250MHz. The PMI interface to the CSW runs at that CCLK frequency that may 
range from 200 to 300 MHz, as it is tied to the processor clock rate. The synchronizer boundaries between the two 
domains are contained entirely in the CSI, REQ, and CMP units. 


13.6.1 The Requestor Unit REQ 


The requester unit transforms CSW I/O accesses (RDIO and WTIO) into PCI Express Transaction Layer 
Packets (TLPs). In the case of read transactions, the REQ also handles the returning completion TLP from the 
RC and turns it into a 64 bit data transfer over the CSW back to the original requesting processor. RDIO and 
WTIO requests are limited to no more than 64 bits. As such, only CSW Data0 and the associated byte mask are 
relevant. The REQ generates six kinds of TLPs: Memory Read, Memory Write, IO Read, IO Write, Config Read 
and Config Write. In the case of all non-posted requests, the PCI TID assigned to the transaction is equal to the 
TID received on the CSW command/address TID inputs. This allows simple matching of completion events to the 
initiating request. 

A block diagram of the REQ is shown in Figure 13.2. 


13.6.1.1 REQ Memory Read Request Handling 


A memory read request is initiated by a CSW RDIO to an address in the PCI Memory Address Range. The 
RDIO CSW operation arrives on the inbound command bus. It is then converted into a transaction on the XALIO 
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interface that will create a memory read request transaction. At some later time, the RC will respond with a 
completion packet on the RCPL port. This will be converted by the REQ into a transaction on the CSW Data 
lines. All memory read requests from the CSW are 64 bit aligned. Addresses on the PCI, however, can be 32 bit 
aligned. As such, if the active bits in the byte mask indicate that the TLP can be contained within a 32 bit aligned 
chunk of data, the address will be modified to be 32 bit aligned and only 4 bytes will be retrieved across the PCI. 
Returned data will either occupy all 64 bits of the Data0 lines on the CSW, in the case of a 64 bit access, or the 
32 bits retrieved will be duplicated on the upper and lower 32 bits of Data0. Requests to addresses within the first 
4GB of the PCI Memory Address range will cause 3 DW header (32 bit address) transactions, while those to the 
remainder of the range will cause 4 DW headers (64 bit address) transactions. 


13.6.1.2 REQ Memory Write Request Handling 


A memory write request is initiated by a CSW WTIO to an address in the PCI Memory Address Range. The 
WTIO CSW operation arrives on the inbound command bus. The REQ responds by initiating a RDIO CSW 
operation to retrieve the write data from the original requesting processor. Once the data has arrived, the REQ 
builds a memory write TLP by wiggling the appropriate signals on the XALIO interface to create a memory write 
transaction with the appropriate byte mask. Like read requests, write requests are aligned to 64 bits. However, if 
the data to be written is contained within one 32 bit aligned chunk, as indicated by the byte mask, the address 
will be modified to be 32 bit aligned and only 4 bytes of data will be sent. Requests to addresses within the first 
4GB of the PCI Memory Address range will cause 3 DW header (32 bit address) transactions, while those to the 
remainder of the range will cause 4 DW headers (64 bit address) transactions. 


13.6.1.3 REQ IO Read Request Handling 


An IO read request is initiated by a CSW RDIO to an address in the PCI I/O Address Range. Other than the 
transaction type field driven to the RC, the REQ processes an IO Read Request in the same manner as a memory 
read request. IO requests are, however, limited to no more than 32 bits of data and address (this means the byte 
mask for DataO from the CSW can only have bits set in the upper or lower nibble). The address is appropriately 
modified as per the bits in the bit mask. 


13.6.1.4 REQ IO Write Request Handling 


An IO write request is initiated by a CSW WTIO to an address in the PCI I/O Address Range. Other than the 
transaction type field driven to the RC, the REQ processes an IO Write Request in the same manner as a memory 
write request. IO requests are, however, limited to no more than 32 bits of data and address (this means the byte 
mask for DataO from the CSW can only have bits set in the upper or lower nibble). The address is appropriately 
modified as per the bits in the bit mask. 


13.6.1.5 REQ Configuration Read Request Handling 


A config read request is initiated by a CSW RDIO to an address in the PCI Configuration Address Range. The 
REQ processes a Configuration Read Request in a similar manner as a memory read request. The transaction type 
is different and the address is modified to shift bits [27:12] up to bits [31:16]. The address is also appropriately 
modified to account for config transactions being 32 bit aligned. If bits [27:20] in the CSW address match the 
primary bus number of the RC, an error will be returned to the originator. This signifies an attempt to access the 
RC config registers. Accesses of the config register within the RC can only be made via the DBI. If bits [27:20] in 
the CSW address match the secondary bus number, a CONFIGO type transaction will be sent. Any other values 
will be sent as a CONFIGI1 type transaction. 


13.6.1.6 REQ Configuration Write Request Handling 


A config write request is initiated by a CSW WTIO to an address in the PCI Configuration Address Range. The 
REQ processes a Configuration Write Request in the similar manner as a memory write request. The transaction 
type is different and the address is modified to shift bits [27:12] up to bits [31:16]. The address is also appropriately 
modified to account for config transactions being 32 bit aligned. If bits [27:20] in the CSW address match the 
primary bus number of the RC, an error will be returned to the originator. This signifies an attempt to access the 
RC config registers. Accesses of the config register within the RC can only be made via the DBI. If bits [27:20] in 
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the CSW address match the secondary bus number, a CONFIGO type transaction will be sent. Any other values 
will be sent as a CONFIGI1 type transaction. 


13.6.1.7 REQ Sub-blocks 


The REQ spans both the CCLK and the ICLK domains. The CXD and CCM both operate in the CCLK 
domain. The PXD and PCM operate in the ICLK domain. The RRF handles all clock domain crossings. 

Commands from the CSW are pushed into a FIFO within the CXD. The FIFO is six entries deep (one for each 
of the six processors — we don’t allow the DMA engine to send transactions to the RC). Commands are taken off 
the FIFO one at a time and fully processed before the next command is attended to. The CXD is responsible for 
decoding the incoming address (to determine which address region — PCI Memory, PCI I/O, or PCI Configuration 
— the address maps into) and sending the command/data to the RRF. If the operation is a write operation, the 
CXD must issue a RDIO command to first fetch the data payload and will write the command and data once the 
RDIO data is received. Read commands are sent to the RRF directly after address decoding. 

The PCI Express side of the transmit path (the PXD) reads the command/data from the RRF. The PXD 
converts the Address, Command, Byte mask, and TID from the CSW into the appropriate outbound packet via 
the XTALIO bus to the RC. 

Completion packets arrive on the RCPL port from the root complex and go to the PCM. The PCM rips the 
reply packet apart and writes the returned 64/32 bit word and transaction ID into the RRF. A completion can not 
be serviced until all write transactions that preceded it coming from the RC have been completed. The CCM takes 
the data from the RRF and passes it to the DMX, in the case of read operations. It also sends a release to the 
CXD for all completions, allowing it to move onto the next command. 


13.6.1.8 REQ Exception Handling 


Errors conditions can arise in a number of places in the REQ: 


Errored Completion from Root Complex If the RC signals an error in a completion, the error details will 
be logged in the PmiReqCompErr register (section 13.13.15) and a bit set in the Pmilntr register (section 13.13.2). 
Sources of this error include bad ecrc, poisoned, unsupported request, completer abort, config retry, tlp abort, dllp 
abort and completion timeout. The PmiReqCompErr register includes information containing the reason for the 
failed completion. 

If the transaction was a read, all ones data will be returned to the originating processor. The exception to this 
is a Config Read with an “unsupported request” completion; this is a normal part of the enumeration process and 
so all ones data will be returned, but no error logged. 

It is expected that in the event of a config retry, the originating processor will reissue the config command after 
a suitable delay as required by the PCI Express specification. 


Data with Bad ECC from CSW If data with an ECC error arrives from the CSW, the error details will be 
logged in the PmiReqEccErr register (section 13.13.14) and a bit set in the Pmilntr register (section 13.13.2). The 
transaction will be completed regardless of whether the error was of a single bit or double bit nature. 


13.6.1.9 RC Config Register Access 


The Requester unit does NOT support the legacy I/O based configuration mechanism present in some earlier 
personal computer based implementations of PCI root complexes. That is, we don’t support the “PCI Compatible 
Configuration Mechanism” using I/O addresses 0CF8 and OCFC. All configuration transactions to non-RC devices 
are via the PCI Express Enhanced Configuration Mechanism. The RC config registers can only be accesed via the 
DBI interface. See Section 13.6.3. 


13.6.2 The Completer Unit CMP 


The Completer Unit is responsible for handling incoming requests from downstream PCI Express devices. The 
primary goal in the design of the CMP is to maximize the available bandwidth. We are not necessarily aiming 
for low latency; we’ll trade latency for more bandwidth whenever we get the chance. The PMI must support an 
aggregate bandwidth of 2GB/s in each direction to keep the link fully busy. 
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The CMP handles three transaction types: Memory Write, Memory Read and Message Signalled Interrupt 
operations. Each is first handled in the ICLK domain where the incoming completion or request packet is disas- 
sembled and digested. The digested form is then sent to a component in the CCLK domain where it is converted 
into a command or sequence of commands on the CSW. Data and header information for read requests are sent 
back into the ICLK domain to be sent along to the RC. 


13.6.2.1 Memory Write Operation 


When a downstream device on the PCI Express bus writes a block of memory, the data item may range in size 
from a single byte up to a 512B block. (We are capping the size to 512B within the RC). The data may or may not 
be aligned to a 64 byte boundary. Figure 13.3 shows the major blocks that participate in serving memory write 
operations. Note that PCI Express MemWrites are posted operations, so that no response is required on the part 
of the application layer. 

Memory write operations are first fielded by the SYC, which is shared between the memory write and memory 
read logic. The payload is written into a data FIFO. The data is aligned to 128 bit boundaries, as found on the 
CSW, before it is written. The SYC also writes the byte masks, the start address, and data block length into the 
write command FIFO. When either the data or command FIFOs are full the SYC will assert a flow control halt 
signal back to the RC to stall the incoming request bus. All of this is done in the ICLK domain. 

The CCW pulls the header/data from the FIFOs. The domain crossing from the ICLK to the CCLK is handled 
by this action. In the case of data blocks that are correctly aligned, the CCW will initiate a BWT operation for 
each 64 byte block in the incoming payload. It is important that we keep the data writes in order. For this reason 
and in order to prevent deadlock conditions, the CCW will not send out the command for a BWT to block X+1 
until it has seen the BWTGO response for the BWT operation on block X. This may limit a single PCI device to 
less than the 4GB available bandwidth on the CSW data bus. 

For blocks that are not naturally aligned or are less than 64B, the CCW must perform a write merge. The CCW 
will launch a RDEX operation for the initial block of data, a WINV to return the merged data, BWT operations 
as required for intermediate data and a final RDEX/WINV as required at the end. Each of these steps are handled 
serially and therefore only one write 64B merge buffer is required. The performance of transactions requiring merges 
will be much less than aligned transfers. Write requests from the PCI must be allowed to pass read requests from 
the PCI to forestall deadlock conditions. 

The posted data buffer in the RC will be ECC protected. In the event of an uncorrectable error, status registers 
will record the syndrome and address associated with the error and a slow interrupt will be generated if enabled. 
The write will otherwise proceed as if un-errorred. If RDEX merge data has an uncorrectable error, the address 
and syndrome associated with the error will be recorded and a slow interrupt will be generated if enabled. Control 
registers allow the purposeful corruption of the data coming from the RC posted data buffer and written to the 
write data FIFO in the SYC. 


13.6.2.2 Memory Read Operation 


Memory read operations are non-posted transactions, so a completion is required. The memory read logic is 
shown in Figure 13.4. 

Incoming read requests arrive at the SYC via the RTRGT1 port — the same port that carries write requests and 
MSI delivery packets. The SYC receives the incoming read requests and places them into a read request FIFO. 
This is done in the ICLK domain. 

The request is pulled from the FIFO by the CCR, thereby effecting the clock domain crossing to the CCLK. 
A request can not be serviced until all write requests that preceeded it out of the RC have been completed by 
the CCW. The request is parsed into one or more BRD operations. If the request begins or ends at an unaligned 
address, the unneeded data from the first and last BRDs will be discarded prior to being presented to the SYC and 
written into the Completion Data FIFO. This weeding is done on 128 bit quanta. 

Up to three BRDs can be in flight at any one time. The data associated with the BRDs need not come back 
from the CSW in the order they were requested, but they must be presented to the SYC in order. Three buffers 
within the CCR first accept the data from the CSW as it arrives. A separate state machine reads the data from 
these buffers and sends it to the SYC in the needed order. The “weeding” mentioned above is done at this point. 

At the time the request service begins, the request information is also written into the Completion header FIFO 
in the SYC. A state machine in the SYC services each request in turn, generating the appropriate PCI transactions. 
Servicing of a completion header begins by determining if a split completion is required, what the data alignment is 
and how much data is required in quanta of 128 bits. While the PCI Express requests may ask for up to 4K Bytes 
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in a single transaction, we limit our completions to 512B. The actual size is set by the Max_Payload_Size register 
in the RC. When the correct amount of data is present in the Completion data FIFO, a completion is sent to the 
RC via the XALI1 interface. 

Data coming from the CSW is ECC protected. The data with ECC is forwarded to the Completion Data FIFO 
without being checked. When read out of this FIFO, the ECC is checked and a “bad EOT” signalled to the RC 
in the event of an uncorrectable error. Status registers will record the syndrome and address associated with the 
error and a slow interrupt will be generated if enabled. 


13.6.2.3 Message Signalled Interrupts 


MSI interrupts are implemented by PCI Express devices as memory write transactions to an address that was 
initially written by the configuration software. That is, each device capable of initiating an MSI interrupt has a 
message address register to which it will write to signal the interrupt. Each such device also has a 16 bit message 
data register that will be written to the message address when the interrupt is signalled. 

That fits rather nicely in with the interrupt scheme implemented in the ICE9 processor segment. Interrupts 
are delivered to a processor via the CSW INTR transaction that writes a 16 bit value to an interrupt cause FIFO. 
The low three bits (the intsel or interrupt select field) of the interrupt designate which of the six interrupts is to be 
signalled. The upper 13 bits (the reason field) contain any information the device requires to identify the reason 
for the interrupt. 

So, the MSI scheme is rather simple. When the CCW detects a memory write to an address range specified 
by the PmiMsiAddr register (section 13.13.19), it generates a CSW INTR command to the processor (address bus 
stop) identified by address bits 5:2. The “address” associated with this command are the low 12 bits from the write 
data payload. 

The MSI INTR command always uses TID PCIWTS. 


13.6.3. The Control/Status Widget CSI 


The control/status widget implements the interface between the CSW and the DBI/SII/VMI ports on the root 
complex, as well as supporting access to the Serial Configuration Bus controller, the PCI Express Phy, internal 
PMI configuration registers, and the 16550 UART. The CSI is shown in Figure 13.5. 

Commands from the CSW are pushed into a FIFO. The FIFO is six entries deep, one for each of the six 
processors. Commands are taken off the FIFO one at a time and fully processed before the next command is 
attended to. The CSI processes only RDIO and WTIO commands from the CSW command/address bus. In the 
case of a RDIO, it will read the appropriate data register from the target and return the data. In the case of a 
WTIO, the CSI will initiate a RDIO command to the processor that issued the WTIO so as to acquire the write 
data. When the RDIO completes, the write data will be written into the target register. 

The CSI is comprised of the WBI, DBI, CIF, CRI and CIN sub-blocks. The WBI is the wishbone bus interface 
to the UART and I2C. The DBI accesses the interface of the same name on the RC. The CIF contains the CSW 
command FIFO and handles the interfacing to the CSW. The CRI handles the interface to the Phy. The CIN 
contains the PMI internal status and control registers as well as allowing access to the RC SII and VMI signals. It 
also handles the slow interrupt generation. 


13.6.3.1 The CSW Interface CIF 


The CIF executes the CSW protocol. It accepts commands from the CSW and places them into a FIFO. The 
commands are pulled from the FIFO and parsed to determine if a RDIO back to the originating processor is required 
and also to determine which sub-function within the CSI should receive the command/data. The appropriate sub- 
function is sent the request and an ack awaited before moving directly onto the next command, in the event of a 
write, or sending the data back to the CSW and awaiting a CSW grant, in the event of a read, before moving onto 
the next command. 


13.6.3.2 The Wishbone Interface WBI 


The WBI receives requests from the CIF and translates them into the wishbone protocol. It awaits an ack from 
a wishbone device (the UART or I2C) and signals the CIF that the request has been completed. In the event that 
an ack is not received in a timely fashion, a completion is sent back to the CIF anyway. If the request was a read, 
all ones data is returned with the completion. The number of clock ticks until a timeout occurs is under software 
control via the PmiWbToVal register (section 13.13.20). 
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13.6.3.3. The RC Register Interface DBI 


The DBI receives requests from the CIF and translates them into the data bus interface protocol as defined 
in the Synopsys Root Complex documentation. All configuration header space and extended configuration header 
space registers that pertain to the RC, are only accessable via this interface. The data bus interface includes a 
request/ack handshake. When the ack occurs, a completion is signalled back to the CIF, with or without data. 
This interface requires a clock crossing from the CCLK to the ICLK domains for a request and from the ICLK to 
the CCLK domains for the ack/read data. 


13.6.3.4 The Phy Interface CRI 


The CRI receives requests from the CIF and translates them into the interface protocol as defined in the Phy 
Core documentation. The interface includes a request /ack handshake. When the final ack occurs, a completion is 
signalled back to the CIF, with or without data. This is an asynchrounous interface; only the ack signal coming 
back from the PHY needs to syncronized to the CCLK. 


13.6.3.5 The PMI Register Block CIN 


The Cin performes a number of functions: 

The CIN enacts the Vendor Message Interface (VMI) handshake. This is used to cause the RC to send a 
downstream vendor message. It is initiated by writing the appropriate data to the PmiVmReqDat (section 13.13.21), 
PmiReqHadr (section 13.13.22) and PmiVmReqCmd (section 13.13.23) registers. When the ack returns from RC, a 
completion is signalled to the CIF so that it can move onto the next command. 

The CIF also aggregates all the SII (System information Interface) signals into a a number of registers. They 
are enumerated and described in the PMI Control and Status Register section (section 13.13). 

All of the various error and status conditions that could cause a slow interrupt are aggregated into the Pmilntr 
register (section 13.13.2) withinn the CIN. The interrupt enable register PmilntrEn (section 13.13.3) determines 
which of the potential sources can cause a slow interrupt. Some of the sources can be cleared directly by writing 
a one to the appropriate bit in the Pmilntr register. Others can only be cleared by sifting through the causality 
hierarchy to find the origin. 

In the event that the RC signals that a legacy interrupt has been asserted, this assertion will not be readable 
via the CSW until all write commands that preceeded the interrupt message have been completed by the CCW. 


13.6.3.6 CSI Exception Handling 


Errors conditions can arise in a number of places in the CSI: 


Data with Bad ECC from CSW If data with an ECC error arrives from the CSW, the error details will be 
logged in the PmiCsiEccErr register (section 13.13.10) and a bit set in the Pmilntr register (section 13.13.2). The 
transaction will be completed regardless of whether the error was of a single bit or double bit nature. 


Out of range address from CSW If a request arrives whose address does not pertain to any of the subfunctions 
within the CSI, the error details will be logged in the PmiCsiAddrErr register (section 13.13.11) and a bit set in the 
Pmilntr register (section 13.13.2). The CSW protocol will be completed, meaning that read data with all ones will 
be returned for a read and a RDIO will be issued for a write with the subsequently returned data being discarded. 


64 bit DBI access request The DBI port to the RC has a 32 bit data path. If an access of more than 32 bits 
is requested, the error details will be logged in the PmiCsiDbiErr register (section 13.13.12) and a bit set in the 
Pmilntr register (section 13.13.2). The CSW protocol will be completed, meaning that read data with all ones will 
be returned for a read and a RDIO will be issued for a write with the subsequently returned data being discarded. 


Wishbone Timeout If an access to a wishbone component (the UART or I2C) times out, , the error details will 
be logged in the PmiCsiWtoErr register (section 13.13.13) and a bit set in the Pmilntr register (section 13.13.2). 
The CSW protocol will be completed, meaning that read data with all ones will be returned for a read. (The 
protocol for a write had already completed prior to the transaction to the wishbone being started and hence before 
the timeout.) 
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13.6.4 The Command/Address Multiplexer CMX 


The Command/Address multiplexer takes command inputs from each of the command processing units (REQ, 
CMP, CSI, and CIN). Requests from the CMP are given priority over the other three, who are selected on a LRU 
basis. The command processing units can only present requests one at a time and move onto a new request only 
when given a grant. 

The CMX also buffers the Command/Address from the CSW headed to the CMP, REQ or CSI. It parses the 
address to determine the target of the incoming command. There is no throttling mechanism for incoming requests 
from the CSW, so they are parsed and sent to FIFOs within each of the target units. 


13.6.5 The Data Multiplexer DMX 


The data multiplexer accepts inputs from each of the data sourcing units (REQ, CMP, and CSI). Requests from 
the CMP are given priority over the other two, who are selected on a LRU basis. The data sourcing units can 
only present requests one at a time and move onto a new request only when given a grant. In the case of requests 
from the REQ and CSI, the data can only be up to 64 bits in length and hence is accepted at the time of the 
grant. Data from the CMP is 64B in length. At the time of a request from the CMP, the DMX will immediately 
issue a grant as long it is not busy servicing another request. The data from the CMP is then streamed into the 
DMxX in preparation for streaming onto the CSW data lines as soon as the CSW grant is received. This puts the 
outbound data right next to the CSW and allows the buffer within the CMP to be freed for the assembly of the 
next transaction. The DMX generates ECC for all data headed to the CSW. The PmiFrcEccErr register (section 
13.13.9) allows the purposeful corruption of the data headed out to the CSW. The DMX also buffers the data 
transactions from the CSW. 


13.7 Valid CSW Operations 


The PMI both accepts commands/data from the CSW and sends commands/data to the CSW. The following 
enumerates the sequence of events that are permissible in interacting with the PMI. The nomenclature used is that 
“PMI:BWT(COH)” means that a BWT command was sent by the PMI to the COH via the CSW. 

CSW:RDIO -> PMI:DATA 

CSW:WTIO -> PMI:RDIO -> CSW:DATA 
MI:BWT(COH) -> CSW:BWTGO(COH) -> PMI:DATA(COH) 

MI:BWT(COH) -> CSW:BWTGO(PX) -> PMI:DATA(PX) 

MI:BWT(COH) -> CSW:BWTNOHIT -> PMI:DATA(COH) 

MI:BWT(COH) -> CSW:PRBINV -> PMI:DATA(COH) 

MI:RDEX(COH) -> CSW:DATA(COH) 

MI:RDEX(COH) -> CSW:DATA(PX) -> PMI:PRBDONE(COH) 

MI:RDEX(COH) -> CSW:PRBNOHIT -> PMI:RDEXR(COH) -> CSW:DATA(COH) 
MI:WINV(COH) -> PMI:DATA(COH) 

MI:BRD(COH) -> CSW:DATA(COH) 

MI:BRD(COH) -> CSW:DATA(PX) -> PMI:PRBDONE(COH) 

MI:BRD(COH) -> CSW:PRBNOHIT -> PMI:BRDR(COH) -> CSW:DATA(COH) 
CSW:PRBWIN(PX) -> PMI:PRBNOHIT 

CSW:PRBBWT(PX) -> PMI:BWTNOHIT 

CSW:PRBBRD(PX) -> PMI:PRBNOHIT 

CSW:PRBSHR(PX) -> PMI:PRBNOHIT 

PMI:INTR -> CSW:DONE 


Oe SOO er Oe 0: so sod 


13.8 Valid PCI Operations 


Coming from an endpoint, the RC and PMI will only accept completions, MemWrites, MemReads, vendor 
messages and MSIs (which look just like MemWrites). A core within the ICE9 can initiate a MemWrite, MemRead, 
IO Write, IO Read, Config Write or Config Read transaction headed to a downstream device. It can also cause 
certain status messages to be sent, as specified in the register definitions below and in the RC specification. 

All Config and IO transactions use 32 bit addressing and data. A MemWrite or MemRead can use 32 or 64 bit 
addressing and up to 64 bits of data. A Mem command to an address with Addr[63:32] = 0x8 will result in 3 DW 
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header (32 bit address) being sent. Mem commands with those same bits set to 0x9, OxA or OxB will cause a 4 
DW header (64 bit address)to be sent. A practical consequence of this is that 32 bit endpoints will use up some of 
the bottom 4GB of the main memory allocation. 


13.9 Ordering Rules 


Before stating the ordering rules, it would be good to define a couple terms. An “inbound” transaction is one 
originating at an endpoint and heading to the ICE9. An “outbound” transaction is one originating within the ICE9 
and heading to an endpoint. 

Inbound transactions can only be posted operations (memory writes, message signalled interrupts (MSIs) or 
vendor messages), memory reads or completions. Posted operations are handled in order; a posted operation can 
not pass another posted operation. The exception to this, in the case of the ICE9, is that vendor messages are 
handled at presentation from the PRC to the PMI, whereas the other two types of posted operations are stuck 
in a queue and handled when they get to the top of the queue. Posted operations can pass memory reads and 
completions. Memory reads and completions are also handled in order of presention, but neither are handled before 
any posted operation that preceded it. Completions can, however, pass memory reads. 

Outbound transactions can be posted (memory writes or vendor messages), non-posted (config reads/writes, IO 
reads/writes or memory reads) or completions. Similar rules as above apply to the outbound transactions. Posted 
operations occur in order except that vendor messages can pass memory writes. Non-posted operations occur in 
order, as do completions. Completions can pass non-posted operations, but can not pass posted operations. 

For the puposes of the ICE9, the “timestamp” of an operation is not when it first comes across the CSW, but 
when it gets to the top of the REQ queue and, if needed, the associated data has been retrieved from the originating 
processor. 


13.10 Auxiary PCI Signals 


There are a number of signals needed to control the PCI Express module or card. 


13.10.1 PERST# output 


PCle express module or card fundamental reset. Active low on the PCB. Resets the PCle card or express module 
attached to the ICE9 when asserted. The logic is PERST# = ~(ResetCard | MPWRGD¥#). The ResetCard signal 
is bit 11 in the Core Control Register (section 13.13.1). Drives PERST# on cards and MRST¥# on express modules. 


13.10.2 MPWRGD# input 


PCle express module power good. Active low on PCB. See PERST# for usage. On CPU modules, which 
support PCIe express modules, MPWRGD¥ is pulled up on the PCB. Therefore, MPWRGD# is deasserted by 
default; an express module must drive it low to assert it, and PERST# cannot be deasserted until it does so. On 
development boards, which support PCle cards, MPWRGD¥ is pulled down on the PCB; therefore, it is always 
asserted. This is necessary since PCle cards don’t support MPWRGD# and PERST# could never be deasserted 
otherwise. 


13.10.83 PWRELT# input 


PCle express module power fault. Active low on PCB. When asserted a 1 should appear in Slot Status Register[1] 
(Power Fault Detected). This is probably meant to be a sticky bit since PWRFLT can be transient. On CPU 
modules and development boards PWRFLT# is pulled up on the PCB. Therefore, it is deasserted by default; an 
express module must drive it low to assert it. PCIe cards don’t support this signal, so it is never asserted on 
development boards. 


13.10.4 PWREN# output 


PCle express module power enable. Active low on PCB. Driven by Slot Control Register{10] (Power Controller 
Control). 
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13.10.56 PRSNT# input 


PCle express module or card present. Active low on PCB. When asserted a 1 should appear in Slot Status 
Register[6] (Presence Detect State), otherwise a 0. Presence Detected Changed presumably has to get set when 
PRSNT changes state, and is a sticky bit. 


13.10.6 ATNLED output 


PCle express module attention LED. A state machine controls this output, which can be on, off, or blinking. 
The output behavior is defined by Slot Control Register[7:6]). If blinking, the on or off time of the 50% duty cycle 
signal is defined by the LED Blink Rate Register (section 13.13.4). This register gives the high or low time in clock 
cycles; the frequency should be 1-2Hz. 


13.10.7 PWRLED output 


PCle express module power LED. A state machine controls this output, which can be on, off, or blinking. The 
output behavior is defined by Slot Control Register|9:8]). If blinking, the on or off time of the 50% duty cycle 
signal is defined by the LED Blink Rate Register (section 13.13.4). This register gives the high or low time in clock 
cycles; the frequency should be 1-2Hz. 


13.11 Definitions 
Package 


chip_pci_spec 


13.11.1 PCI Type Enumerations 


Enum 


PciType 


13.11.22 PCI Format Enumerations 


Enum 


PciF'mt 


Constant 


NODAT3DWH | 3 DW header without data 
NODAT4DWH | 4 DW header without data 


DAT3DWH 3 DW header with data 
2°h3 DAT4DWH 4 DW header with data 


13.11.38 PCI Completion Status Enumerations 


Enum 


PciCplStat 
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Successful Completion 
Unsupported Request 


Configuration Request Retry Status 


Completer Abort 


13.11.4 PCI Completion State Machine State Enumerations 


Enum 
PciCmpSm 
Tho | IDLE 


hi WAIT Wait for data to accumulate 
2’h2 STREAM Stream data out to RC 


13.11.5 PCI Block Write State Machine State Enumerations 
Enum 


PciBwtSm 


Tho 
Th 
The 
Ths 
Thd 
Ths 
The 
Thr 
Ths 


Vh9 INTRDONE Waiting on the DONE in response to the 
INTR 


13.11.6 PCI Block Read State Machine State Enumerations 


Enum 


PciBrdSm 


Constant 


IDLE 
BRDCMD Sending BRD command 


BRDRCMD Sending a BRDR command 
PRBDONCMD | Sending PRBDONE command 


13.11.7 PMI Request Result Enumerations 
Enum 


PmiReqRes 
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NODAT 
DATS2 
DATA 
UNSUPPORTED 


POISONED 
BADECRC 
BADLENGTH 
DELPABORT 
TEPABORT 
TIMEOUT 
RETRY 
ABORT 


13.11.8 Pmi Events 


The following events are trackable by SCB statistical event counting. 


Enum 


PmiScbEvent 


Attributes 


-descfunc 


8’h05 MEMW32_OUT Number of outbound PCI Memory Writes with 32 bit 
data. 

8’h06 MEMW64_OUT Number of outbound PCI Memory Writes with 64 bit 
data 


M 

M 
8’h07 MEMR32_OUT Number of outbound PCI Memory Reads with 32 bit data. 
8’h08 MEMR64_OUT Number of outbound PCI Memory Reads with 64 bit data. 
S00 


8’h10 MEMWA64_IN Number of inbound aligned memory writes with data of 
64B or less. 


8’hil MEMWA128_IN | Number of inbound aligned memory writes with data of 
8’hl2 MEMWA256_IN | Number of inbound aligned memory writes with data of 
N 


ME) I 

ME) = 
8’h13 MEMWA512_I Number of inbound aligned memory writes with data of 

Pee ee ee mee eee eee 

8’hl4 MEMWU64_IN Number of inbound unaligned memory writes with data 
8’h15 MEMWU128_IN | Number of inbound unaligned memory writes with data 
8’h16 MEMWU256_IN | Number of inbound unaligned memory writes with data 

ME) N 


8’h17 MEMWU512_I Number of inbound unaligned memory writes with data 
of 512B or less. 


shissnir[ —————*d Reserved SSS 
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8’h20 MEMRAG64_IN Number of inbound aligned memory reads with data of 
aa 

8’h21 MEMRA128_IN Number of inbound aligned memory reads with data of 
ed 

87h22 MEMRA256_IN Number of inbound aligned memory reads with data of 
hall 


2 S. 

8’h23 MEMRA512_IN Number of inbound aligned memory reads with data of 
[coil 

8’h24 MEMRUG64_IN Number of inbound unaligned memory reads with data of 
el 

87h25 MEMRU128_IN Number of inbound unaligned memory reads with data of 
fecal 

8’h26 MEMRU256_IN Number of inbound unaligned memory reads with data of 
aici 


8’h27 MEMRU512_IN Number of inbound unaligned memory reads with data of 
512B or less 


ST2SSe 


13.12 PCI Express Root Complex Registers 


All of the registers in this section are within the Synopsys Root Complex. The details of these registers were 
taken from the document supplied by Synopsys. 


13.12.1 Device/Vendor ID Register 
Description 
Register 

R_Pcield 


Attributes 


-kernel 


Address 
OxE_9800_0000 


Definitions 


[Si [DevicID [ RW [| 1 |_| DeviceID. SSCS. 
15:0 | VendorID_[~RW_|0x19B2 |__| Vendor ID. Assigned by POFSIG. 


13.12.2. Command and Status Register 
Description 
Register 

R_PcieCmdStat 


Attributes 


-kernel -writeonemixed 
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Address 
OxE_9800_0004 


Definitions 


31 DetParErr RW1C 1 = forwarding an outbound Poisoned TLP (bit is set 
regardless of ParErrEn). 


SigSysErr RW1C Set when RC generates ERR_(NON)FATAL message and 


SerrEn = 1 in the Command Register (bit 40 in this reg- 
ister) 


29 RevdMstrAbrt | RW1C Set when primary side of RC receives UR Completion Sta- 
een EL [eateries Senin 
28 RevdTgtAbrt RW1C Set when primary side of RC receives CA Completion Sta- 
ee ei i Hen ee | 


SigTgtAbrt RW1IC | 0 | [| Set when RC sends CA Completion Status for Request. 


paaas TCR SC Od Reseed 
24 | MstrDatParErr | RW1C 1 = received (and forwarding) an inbound Poisoned TLP 
(and Parity Error Response bit - bit 38 - in Command 
| ae portion of this Register is set). 
aR Reserved SSCS 
eR Reserved ss 
arf Reserved 
20[ Caphist___[R_ | 1_|___| Indicates presence of extended Capabilities Dist 
[19 [Int Status_[_R[0-_[-___nicates pending INTx Message. Irvelevant for RC 
C0 
rE RW | 0 |__| Disables INTs interrupts Rom being sent ————S—=* 
SE a SE FS 
SerrEn RW (Non)fatal error messages (from Endpoint) reported if this 
bit is 1 (or if another bit in Device Control Register is set). 
This reporting takes the form of updating Root Control 
register and/or Root Error Command register, and log- 
ging in the Root Error Status register and Error Source 
ID register. 
70 Reserved SSS 
[6 | ParinResp [RW | 0_[__[ Parity Exvor Response 
pss] SST SRO] Reserved SSCS 
2 BusMstrEn RW Bus Master Enable. 0 = treat incoming MEM/TIO re- 
ay ae a quests as Unsupported Request. 
| 1 | MemDecEn | RW [| 0 | | 1= Allow MEM accesses from Endpoint 
| 0 | oDecEn =| RW | 0 [| 1 = Allow IO accesses from Endpoint 
13.12.3. RevID, Class Code Register 
Description 
Register 
R_PcieRevld 
Address 


0xE_9800_0008 


Definitions 
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BES | ClassCode | RW | Ox060400 |_| Class Code ——SSSOS—SCSC~C~S 
P70 [Reid [RW [1 | | Revision IDSC 


13.12.4 Cache Line Size, BIST etc register 


Description 


Unsure what the Revision ID and Class Code is for this chip. The Cache Line Size is irrelevant for PCI Express 
functionality. Master Latency Timer Register is hardwired to 0. Need to find the details of Header Type and BIST 
fields. 

Register 

R_PcieCcMisc 


Address 
O0xE_9800_000C 


Definitions 


parar [Bist | _R | 0 |_| Not supported by ROG —SOSSCSC—~—SCS 
- 23 [wider | _RW_[_0 |__| Multi Fnnetion Device SSS 


e216 [HaTyp | _R [1 |_| Config Header Format ———S—S—SCSCSCSCSC—S 
18 [MsiLatTim | R—|_0 |_| Hardwired to 0 ——SSSSSOSCSCSCSCSCSC~*r 
7.0 CackinSiz [RW [0 |__| System Cache Line Size. brelovant tors 


13.12.5 Base Address Register 0 
Description 


The Base Address Registers specify the windows for Memory and IO access from the endpoint. For our Root 
Port, we have no need for this and so will keep it at 0. 


Register 
R_PcieBar0 


Address 
OxE_9800_0010 


Definitions 


31:0 | BaseAddr R Ox4 [31:4} are 0. [83:0] indicate non-prefetchable (0), 64 bit 
(10), memory (0). 


13.12.6 Base Address Register 1 


Description 


The Base Address Registers specify the windows for Memory and IO access from the endpoint. For our Root 
Port, we have no need for this and so will keep it at 0. 
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Register 
R_PcieBar1l 


Address 
OxE_9800_0014 


Definitions 


Par0[BascAdd | R [0 | [Hardwired SCC 


13.12.7 Bus Number Register 
Description 


The Primary Bus number for a Root Complex is 0. The Secondary Bus number is 1. The Subordinate Bus 
number can be any number from 1 (indicating an Endpoint connection) to a number greater than 1 (indicating a 
Switch connection). 


Register 
R_PcieBusNum 


Address 
OxE_9800_0018 


Definitions 


Par2t | SecbatTim | R | 0 |_| WardwiedtoO SCS 


23:16 | SubBosNum | RW [0 |_| Subordinate Bus Numbor——SSSOSCSCSCS 
15:8 | SecBusNum | RW_| 0 |_| Secondary Bus Number——SSSSSSS—S 
70 [PrBusNum [RW [0 |_| Primary Bus Number —SSSSS—SCSCSCSSSCSC—S 


13.12.8 I/O Base/Limit, and Secondary Status Register 
Description 
Register 

R_PcieSecStat 


Attributes 


-writeonemixed 


Address 
OxE_9800_001C 


Definitions 


31 DetParErr RW1C 1 = received Poisoned TLP in inbound direction (regard- 
less of ParErrResp bit value in Bridge Control register). 
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on RevSysErr RWI1C 1 = Received incoming ERR_(NON)FATAL message (Er- 
rata doc on Page 104 says that this bit is not dependent 

on SERR Enable bit in Bridge Control Register). 
[feeliegeen RW1C Set when RC receives UR Completion Status for outbound 
ee Vee ee 
RevTgtAbrt RWI1C Set wien RC receives CA Completion Status for outbound 
Ce ee 
27 SigTgtAbrt ea eel Set when RC sends CA Completion Status for inbound 
rest ———t et a 


r_—_| Reserved ==~SSSCSSCSCSsCSsCSCSs 
1 = sent an outbound Poisoned TLP —————————— Parity Error 
response bit - bit 0 - in Bridge Control Register is set). If 
ParErrResp bit in Bridge Control register is 0, this bit is 
always 0. 
Reserved 
1B: 12 ETO IO Limit Register Value (alongwith implicit zeroes in 
lower 12 bits, provides end/limit of address space of out- 
bound IO transactions in 64KB address space). 
ToLimit30 0 = 16-bit IO address decode (64KB space). 1 = 32-bit 
IO address decode (4GB space). Value in IO Upper Limit 
Register valid if this value is 1. Values 0x2 through OxF 
are reserved. 
7:4 | IoBase74 IO Base register value (alongwith implicit zeroes in lower 
12 bits, provides start address space of outbound IO trans- 
actions in 64KB address space). 


3:0 | IoBase30 0 = 16-bit IO address decode (64KB space). 1 = 32-bit 
IO address decode (4GB space). Value in IO Upper Base 
Register valid if this value is 1. Values 0x2 through OxF 
are reserved. 


13.12.9 Non-Prefetchable Memory Base and Limit Register 


Description 


These registers define the start and end address range for valid outbound Memory transactions. 


Register 
R_PcieMemBase 


Address 
OxE_9800_0020 


Definitions 


aE a MemLmt RW End Address of Memory range. Upper 12 bits of implicit 
32-bit range (lower 20 bits are assumed OxFFFFF). 


8 CO 


wa 4 aE a Start Address of Memory range. Upper 12 bits of implicit 
32-bit range (lower 20 bits are assumed 0). 


Pao [Ro Reeved SSCS 


May 14, 2014 656 Rev 51328 


SiCortex Confidential 13.12. PCI EXPRESS ROOT COMPLEX REGISTERS 


13.12.10 Prefetchable Memory Base and Limit Register 
Description 


These registers define the start and end address range for valid outbound Memory transactions. 


Register 
R_PciePreMemBase 


Address 
OxE_9800_0024 


Definitions 


[3120 [PMenimt | RW [0 |_| Prefetchable Momory Limit SSCS 
FL 


16 [aanea | RWS[1_ |__| ea bit addvesng Fane SSCS 
15st [ PMemBas [RW | 0 |__| Prefetchable Memory Base SSS 
Paap Rd Revered 
[0a RWS] 1 it ae a one 


13.12.11 Prefetchable Memory Upper Base Register 
Description 


These are the upper bits of prefetchable memory base. 


Register 
R_PciePreBaseUpper 


Address 
OxE_9800_0028 


Definitions 


PMemBU | RW | 0 |__| Prefetch Memory Base Upper register 


13.12.12 Prefetchable Memory Upper Limit Register 


Description 


These are the upper bits of prefetchable memory limit. 


Register 
R_PciePreLimit Upper 


Address 
OxE_9800_002C 


Definitions 
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PMemLU | RW | 0 |__| Prefetch Memory Limit Upper Register 


13.12.13 I/O Base and Limit Upper Register 
Description 


These registers define the Upper Base and Limit range for outbound IO space if that space is 32-bits wide. 


Register 
R_PcielIOUpperBaseLimit 


Address 
OxE_9800_0030 


Definitions 


31:16 | IoLimitU RW Upper 16 bits of IO Limit register (only valid if outbound 
IO space is in 4GB space rather than 64KB space). 


15:0 | IoBaseU RW Upper 16 bits of IO Base register (only valid if outbound 
IO space is in 4GB space rather than 64KB space). 


13.12.14 Capability Pointer Register 
Description 
Register 

R_PcieCapabilityPtr 


Address 
OxE_9800_0034 


Definitions 


rsp | OR CT 0 Reed SSO—S—SSSSCCC~C~*r 


aS 0 a a 0x40 Capabiliy Pointer. Points to (contains the offset to) reg- 
ister set associated with the next Capability. 


13.12.15 Expansion ROM Register 


Description 


We do not support an Expansion ROM within the Root Complex bridge. 


Register 


R_PcieExpRom 


Address 
OxE_9800_0038 
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Definitions 


iat iddes oR 1 Reserved. Expansion ROM not supported. 


Pat Ri Reserved 
-0_[ Enable | R_[_0_[__[ Expansion ROM enable SSS 


13.12.16 Bridge Control Register 


Description 
Register 
R_PcieBrgCtrl 


Address 
OxE_9800_003C 


Definitions 


FT SS 

| 22 | SecBusRst | RW [| 0 | | 1 = Triggers Hot Reset on PCI-E link. 

[a1 [ MsteAbort _[R [0 [ [Not applicable 

Cao [ yeas [wo [Ve ti desde 
[19 [-VGAEn [RW 

EDS OR Efe 


17 SerrEn RW 1 = Allows forwarding of received ERR_{COR, NONFA- 
TAL, FATAL} error messages to primary side of Bridge. 
The SerrEn bit in the Command Register controls report- 
as of these forwarded messages to the Root Complex. 


[ide iecesaceey anaes 1 = Enable Master Data Parity Error status bit in both 
A, Wee and secondary status registers. 


en ace ae Interrupt Pin register. Irrelevant for Root Complex. 
| 7:0 | IntLine =| RW | Oxff |__| Interrupt Line register. Irrelevant for Root Complex. 


13.12.17 PCI Power Management Capabilities Register 


Description 
Register 
R_PciePMCap 


Address 
OxE_9800_0040 


Definitions 


Pa [| & | 0 |_| Reserved 


fe Rese re es ee ee 
30:27 | PMESup RWS Oxb Bits 30, 28, 27 set to 1 for Root Port to indicate in which 
states it will forward received PME Messages to the Root 
Complex. 
[36 [Dasup RW [0 fe Support S—S—S—S—SSSSSSSCSC‘“Cst‘—~S*S 
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S$ T’¥”—=$==$=——— 
25 [Disp [RW Suppor 
pana7 | RuxCur [RW | Oe? |_| Auniliary oment SSCS 
21 | Spectnit_[RW_[0 |__| Device Specific Initialization SS 

a 


a 


18:16 | CapVer | RW | 2 | | Capability Version (as mandated by PCI-SIG) 


| 15:8 | NxtCapPtr | RW | 0x50 | | Offset to next PCI capability structure 
| 7:0 | Capld =| R- [| 0x01 | ‘| ID indicating PCI Express Capability Structure. 


13.12.18 PCI Power Management Control Register 


Description 


Register 
R_PciePMCtrl 


Attributes 


-writeonemixed 


Address 
OxE_9800_0044 


Definitions 


raz] OR TO | Reserved ——OSSOSCSCSOSOSCSCCSY 
pasa fd OR 0 id Reserved —SSOSCSCSCSCSCSCSCSCSCSCSC‘“‘“‘*~*d 


paris{ | _R_|_0 |__| Reserved 


| Reserved 
PMESt ane | 0 | | Root Complex will not set this bit. 
oh a a se 


PMEEn Since Root Complex never sends PME Message, this bit 
can be hardwired to 0. 


pmo fi Rf oO | [Reeved SOCSCSCSOSOSOSCSCSCSC*Y 


13.12.19 MSI Capabilities Register 


Description 


Register 
R_PcieMSICap 


Address 
OxE_9800_0050 


Definitions 


[Bit [Mnemonic [ Access | Reset | Type [Definition 


paar Remar 


Reeve 
-23_[ MSIGamn [RWS | 1_| _| 6&bit Address Capable. ——SSSOSCS—SCSCS 


22:20 | MultiMSIEn | RW [| 0 | | Multiple Message Enabled 
19:17 | MultiMSICap | RW [ 0 |_| Multiple Message Capable (writable through DBI) 
| 16 | Msi~Fn = =| RW [| 0 |__| MSI Enabled (when set, INTx must be disabled) 
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C158 [NetCapPea [RW [0x70 |__| Offset to next POT capability structure 


70 | Capld | R | 0x05 |_| ID indicating MSI Capability, 


13.12.20 MSI Address Register 


Description 


Contains the MSI Lower 32-bit address (only upper 30 of these 32 bits are writable). 


Register 
R_PcieMSIAddr 


Address 
OxE_9800_0054 


Definitions 


Par2[NSIAdGL | RW | 0 |_| MSlLowers2btAdhes SSCS 


Prot | OR | 0 i Reserved CSCS 


13.12.21 MSI Upper Address/Data Register 


Description 


Bits 31:0 in this register contain the MSI Upper Address Register, if MSI64En = 1. Otherwise, it contains the 
MSI Data Register. 


Register 
R_PcieMSIUpper 


Address 
OxE_9800_0058 


Definitions 


31:0 | MSIAddrH RW Upper 32-bit Address (or MSI Data register 
MSI64En=0) 


13.12.22 MSI Data Register 


Description 


Contains the MSI Data register is MSI64En = 1. 


Register 
R_PcieMSIData 


Address 
OxE_9800_005C 


May 14, 2014 661 Rev 51328 


SiCortex Confidential CHAPTER 13. PCI EXPRESS SUBSYSTEM 


Definitions 


Paey | RT 0 |_| Reserved 
[15.0_[MSIData_ [RW [0 |__| MSI Data GF MSTSIE=1) 


13.12.23 PCI Express Capabilities Register 0 


Description 


Register 
R_PcieCap0 


Address 
OxE_9800_0070 


Definitions 


parsop TR OP Reeved SOSO—SOSOSOSSCC—SY 
[29.25 [tNseNum | RW | 0 |_| Intemupt Message Number SSCS 


ieee eee eel 1 = Link connected to a Slot. Hardware initialized to a 
1. 


Faa20 [Pot Type [Rot Device Por Type, 1= Root Port of PCTE Rock Complex | 
| 19:16 | CapVer =| R | 1 | | Capability Version (as mandated by PCI-SIG) 

158 [NetCapPa [RW [xd [ | Offset to next PCL capability structure | 
| 7:0 | Capld = | R- | 0x10 | ‘| ID indicating PCI Express Capability Structure. 


13.12.24 PCI Express Capabilities Register 1 


Description 


Register 
R_PcieCap1 


Address 
OxE_9800_0074 


Definitions 


ref OR (Reseed OSCSCSCSSSOSOCCCSCSC“‘“‘CS*S~*S 
[15 _[RBEnRep | RW [1 |__| Role Based Error Reportmg SSCS 
eS a ET 


| 
To [Tibet [RW [1 |__| Endpomt Neceptable Di latency SSS 
86 _[Tostat__[ RW_[ 1 [|__| Bndpommt Acceptable L0s latency 
[5 | fa Tag [RW] 0 |] Only bit Tag fold supported. 
-43_| PhanFine_[ RW [0 [| No Phantom Minetions supported 
[2.0 | MaxPaySiz | RW [0x2 |_| Max Payload Size Supported. 2= SID byes 
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13.12.25 Device Control/Status Register 
Description 
Register 

R_PcieDevCtlStat 


Attributes 


-writeonemixed 


Address 
OxE_9800_0078 


Definitions 


EE = PR 0 | | Reserved 


Be La 1 = Outbound non-posted transactions pending (ie. have 
not completed or have not been terminated by the Com- 
pletion Timeout mechanism) 


[20 | AwxPreDet [—R[ 0 |_| 0 = No Aux Power Detected 


URDet RW1C 1 = Unsupported Request Detected. Independent of any 
ee ee [enwototmacatings 
18 FatErrDet RW1C 1 = Fatal Error Detected. Independent of any control or 
eee eT inticsctng St Tee 


17 NFErrDet RW1C 1 = Non-Fatal Error Detected. Independent of any control 
or mask setting. 


16 CorErrDet RW1C 1 = Correctable Error Detected. Independent of any con- 
trol or mask setting. 
Ee ee ee | Reserved 


Reserved 


| R | 0 | 
kaa MaxRaReg” [Ra Maximam permissible bound read request Sze. | 
| 11 | NoSnpEn | RW | 0 |__| Always 0. We do not enable “No Snoop”. 
[10 RuxPowsn [RW [0 [ Bhable Aux Power 
a A a 7 
75 | MaxPaySiz [RW] 00 = 18 byes T= 280 byes, = BID bytes, 
at iatsont [Rw [1 [as 0 We ont eine Red Ondeing 


UrRepEn RW 1 = enables reporting of Unsupported Request (ie. in- 
bound packet encounters a UR which needs to be reported 
to the host) 


pepe || 1 = Enables reporting of fatal errors (equivalent of en- 
Hs ERR_FATAL messages for a Root Port) 
- — see 1 = Enables reporting of non-fatal errors (equivalent of 
ae ERR_NONFATAL messages for a Root Port) 


Baliced 1 = Enables reporting of correctable errors to the host 
(equivalent of enabling ERR-COR messages for a Root 
Port) 


13.12.26 Link Capabilities Register 


Description 
Register 


R_PcieLnkCap 
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Address 
OxE_9800_007C 


Definitions 


Parad [PortNum | RW | 0 |_| Port Number for the POL Expres tk ——SSSC—*SY 
0 
20 [DEER [_R_|_1_[__| Data Link Tayer Active Reporting Capable _—_—=* 
19 [Suwa [_R_[_0_[ | Suprise Down Fnvor Reporting Capable _——*Y 
is [OkPm@ap [RW [0 |__| Clock Power Management SSS 


17:15 | LIExLat ee eee L1 Exit Latency. Irrelevant for us since we do not support 
Ll. 


Tra | Tosixhat [RW [0x8 [| Los Exit Latency ——SSCSCSC~—~—S 
pisio asp [rw [xs _[ Active Link Pm Support 
944 | Mascink Wik [RW [0x8 [| Max Link Width. 8 lanes mor ease SSS 
3.0_[ MaxinkSpd [RW [Oxt_[ [1 =2.5Gb/s Link All other encodings are reserved. 


13.12.27 Link Control/Status Register 


Description 
Register 
R_PcieLnkCtl 


Address 
OxE_9800_0080 


Definitions 


i i a 
SItClkCfg RW 1 = Component uses same reference clock as on the con- 
raf Pt nector. Initialized by hardware to correct value. 
LnkInTrn 1 = Link Training in progress. Should be set to 0 by 
hardware after successful training to the LO state. 


26 TrainErr R 1 = Link Training Error occurred. Should be set to 0 by 
hardware after successful training to the LO state. 


25:20 | NegLnkWth ea ee Negotiated link width. We should see values of 1, 2, 4, or 
8. 


196 | TakSpd__[ Roel |_| 1 = 25Gb/s All other encodmas are removed 
5 EO (80 aT 


4 ExtSyne RW 1 = Forces extra FTS ordered sets when transitioning 
from low power states to LO. 


ComClkCfg RW 1 = Common Reference Clock at both sides of the Link. 
0 = Asynchronous clocks at both sides of the link. 


5 LnkRetrain R 1 = Initiate Link retraining via the Recovery State. Reads 
vee return 0. 


[aks RW 0 = disable the mk SSS 
3 proce [Rw [0 |__| Read Completion Boundary. 0= Oi bytes SS 
SG eT 
[10 Paseo [Rw [0 [1 = 1s Entry Enabled. Can be disabled by writing 030. | 
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13.12.28 Slot Capabilities Register 
Description 
Register 

R_PcieSltCap 


Address 
OxE_9800_0084 


Definitions 


31:19 | PhySltNum RW 0x0 Physical Slot Number. I believe this should be O for a 
Root Port. 


SItNoCCSup | RW [ 0 | | Slot No Command Complete Support 
SltEmIPrsnt | RW | 0 |__| Slot Electromechanical Interlock Present 


Slot Power Limit Scale. Writes to this register cause Port 


16:15 | SltPwrScl RW 
fee es We hee to send Set_Slot_Power_Limit Message. 

14:7 | SltPwrLmt RW Oxf Slot Power Limit Value. Writes to this register cause Port 
pee i ad to send Set_Slot_Power_Limit Message. 


SlotCap RW Ox7a These 7 bits in the Slot Capabilities register are all hard- 
ware initialized to some value. 


13.12.29 Slot Control/Status Register 


Description 
Register 
R_PcieSItCtl 


Attributes 


-kernel -writeonemixed 


Address 
OxE_9800_0088 


Definitions 


Tt [Reserved 
| _'| 1 = Indicates presence of card in slot. 0 = Slot Empty. 
|__| MRL Sensor State. 0 = MRL Closed. 1 = MRL Open. 


1 = MRL Sensor State Change is detected. 
1 = Power Controller detects power fault in this slot. 


1 = Attention Button is Pressed. 


Reserved and Preserved. 


1 = Power applied to the slot is Off. 0 = Power applied 
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PwrIndCtl RW 0x3 Non-zero writes to this register set these 
bits as well as send the appropriate 
PWR_INDICATOR_{ON,OFF,BLINK} Message. 
AttIndCtl RW Non-zero writes to this register set these 
bits as well as send the appropriate 
ATTN_INDICATOR_{ON,OFF,BLINK} Message. 
et | = Enable Hot-Plug interrupt generation for enabled 
ee 


CmdCplEn RW 1 = Enable Hot- “Plug interrupt generation for Command 
ee na 
PrnDetEn RW 1 = Enable Hot-Plug interrupt generation for presence 
a a a Eee 
2 


MRLSenEn RW 1 = Enable Hot-Plug interrupt generation for MRL Sensor 
Changed event. 


1 PwrFltEn RW 1 = Enable Hot-Plug interrupt generation for power fault 
ee eee ee | 
AttButEn RW 1 = Enable Hot-Plug interrupt generation for Attention 
a a a 


13.12.30 Root Control Register 


Description 


Register 
R_PcieRootCtl 


Address 
OxE_9800_008C 


Definitions 


Es PR [0 || Reseed 


ETE 1 = Root Port should generate interrupt if PME Status 
register bit is set indicating receipt of PME Message. If 
PME Status bit is already set when this bit is enabled, 
interrupt should be generated. (Errata doc Page 110 says 
that the Root Port should generate interrupt wire only 
when Interrupt Disable bit in Command Register is 0 in 
addition to the above 2 bits being set). 


FatErrEn 1 = RC should generate system error if Fatal Error re- 
ported by the Root Port or by devices in its hierarchy. 
This should not happen if SerrEn bit in Command regis- 
ter = 0 (based on Errata doc Page 107). 

1 NFErrEn RW 1 = RC should generate system error if NonFatal Error re- 
ported by the Root Port or by devices in its hierarchy. This 
should not happen if SerrEn bit in Command register = 
0 (based on Errata doc Page 107). 

CorrErrEn RW 1 = RC should generate system error if Correctable Error 
reported by the Root Port or by devices in its hierar- 
chy.This should not happen if SerrEn bit in Command 
register = 0 (based on Errata doc Page 107). 
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13.12.31 Root Status Register 


Description 


The Root Status Register are described here. 


Register 
R_PcieRootStatus 


Address 
OxE_9800_0090 


Definitions 


Soe BO ee (ie a a Reserved 


| Reserved 
an sare 1 = Another PME is pending when PME Status bit is set. 
When PME Status is cleared by software, pending PME 
will cause PME Status to be set again with the updated 
Req ID. Process will continue until no more PMEs are 
pending. 


PMEStat | RWIC] 0 | | 1= PME was asserted by requester in bits 15:0. 
PMEReqid | R | 0 | ___|{ Indicates PCI Requester ID of the last PME requester. 


13.12.32 Advanced Error Reporting Enhanced Capability Header Register 


Description 
Register 
R_PcieAdvErrCapHdr 


Address 
OxE_9800_0100 


Definitions 


[Denton CCS 


mn 3 NxtCapOff lll “redial Next Capability Offset (relative to address 0 of config 
space) 


| 19:16 | 16 | CapVer | | __|{ Capability Version. | Capability Version. Assigned by PCLSIG. by PCI-SIG. 


ExtCapld PECAN] RAT | Eas Ea op ATS PO PCI Express Extended Capability ID. Assigned by PCI- 
SIG. 


13.12.33 Advanced Error Reporting Uncorrectable Error Status Register 


Description 


Bits in this register are sticky and report the error status of individual error sources. 


Register 
R_PcieUCorrErr 
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Address 
OxE_9800_0104 


Definitions 


[| wirenonie [Reco [Reset [Type [Detain 
a ee ee 
-20_PUREnSt | RWIC|_0__[ | Unsupported Request Enor Stas ——SSSOSCSC~—S 
19 PECRCEnSt | RWIC [0 [__[ECRC Ennor Status SSCS 
is Panrrepst_[RWIc [0 |_| Malformed TEP Status SSCS 
a7 RerOvist | RWIC [0 |__| Receiver Overflow Status SSS 
16 [nx piste | RWIC [0 |__| Unexpected Completion Status ———SSS—SCSC—CS 


[15 [OplabnSt_| RWI [0 [| Completer Abort_———SSSSOSCSCSCSCSSCC*' 
a Popirost_[ RWIG [0 |__| Completion Timeout Status SSCS 
8 [FCEnSt__[ RWIC [0 |__| Flow Control Protocol Error Status ———SS—S 
af Pmst_—_[RWIC [0 |__| Poisoned TEP Status SSS 
ps0 Reserved 
ores RWC] 0 Data ine Protocol Error Status 
a 
[0 Rakast [RWIS [0 [__ [Training Ervor Status (default mndefined for Rev 11) | 


13.12.34 Uncorrectable Error Mask Register 


Description 


All unreserved bits in these registers are sticky. 


Register 
R_PcieUncErrMsk 


Address 
OxE_9800_0108 


Definitions 


iar OR OT 0 Reserved SSOSOSSSSSCCCSCS 
20 PURER Nk [| RWS_[—0_[___| Unsupported Request Error Mask 
19 PECRCErNsK | RWS_[—0[ | BORC Bnvor Mask SSS 
is PMFTEPMsk [-RWS[—0[ | Malfomed TLP Mask SSCS 
7 RevOvivisk | -RWS_[—0_[__| Receiver Overflow Mask SSS 
16 [nx Gpinisk [| -RWS[—0[ | Unexpected Completion Mask SS 
15 CplAbriMsk [| -RWS_[ 0 |_| Completer Abort Mask—————SSSSSSS—S—S— 
[a [oprrOMsk —[-RWS[—0_[___| Completion Timeout Mask 
13 [FCEnMsk | RWS_[—0_[___| Flow Control Protocol Error Mask 
[2 PsMisk —_[RWS[ 0 |__| Poisoned TEP Mask SOS 
psp Ri Reserved OOS 
[4 [oop RWS] 0 [Data ink Protocol Error Maske 
[ee TR [0 [ [Reserved SOS 
[0 [TakeNsk [RO |__| Training Error Mask—SSOSOSOSOSOSCSCSCSCSCS 
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13.12.35 Uncorrectable Severity Register 
Description 


All unreserved bits in these registers are sticky. 


Register 
R_PcieUncErrSev 


Address 
OxE_9800_010C 


Definitions 


[Bit Mnemonic Reso [Reset [Type [Deion 
i a eT! 
20 PUREnSer [RWS [0 |__| Unsupported Request Eirvor Seventy SSS 
19 PECRCEnSev [RWS [0 [__[ BCRC Emnor Severity SSS 
[is [MITEPSey [RWS [1 |_| Malformed TEP Severty_——SCS—S 
7 RevOvisev [RWS [1 |__| Receiver Overfiow Severity SSS 
[16 [Tnxp Sev [RWS [_0_[ | Unexpected Completion Seveanty_—SSS—S 


15 [oplabriSev_ [RWS [0 [| Completer Abort Severity SSCS 
1 [OpIrOSev [RWS [0 |__| Completion Timeout Severity SSS 
[13 [RCEnSev [RWS [1 [__[ Flow Control Protocol Error Severity ——SSSS—=S 
[2 [PmSev___ | RWS-[0_[ | Poisoned TEP Severity SSS 
Ee 
[4 [Drie [RWS [1 [__[Data Link Protocol Error Seventy SSCS 
LE 
Pop Reserved OSS 


13.12.36 Correctable Error Status Register 


Description 


All unreserved bits in these registers are sticky. 


Register 
R_PcieCorErrSt 


Address 
OxE_9800_0110 


Definitions 


[Bit [ Mnemonie[ Access | Reset [Type [Definition 
pee 0 Reserva J 
- 13 _[ NF RWid | 0 |_| Advisory NonFatal Error Status ———SSSSSC—*S 
[2 [Rprost__[ RWI | —0_[ | Replay Timer Timeout Status ___——S—~S 


pad SR 0 |__| Reserved 


Rese 
8 [pins [WIC 0 | | Replay Nam Rolover Seams 
[7 [BadDELPSETRWIC | 0 [| Bad DELP Status SSS 
5 Radtepse PRWIC | 0 _[ [Bad TEP Status SSCS 
perp RP 0 Reserved SOSOSCSOSOSOSCSCSCSCSCSCSC*SY 
[0 [Rens [RWIS | 0_[__[ Receiver Error Stas SSCS 
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13.12.37 Correctable Error Mask Register 
Description 


All unreserved bits in these registers are sticky. 


Register 
R_PcieCorErrMsk 


Attributes 


-writeonemixed 


Address 
OxE_9800_0114 


Definitions 


se 0 7 
is [NFEaNas —[RwWIe | “1 [Pa dvsony Non Pata Bor mas J 
[12 RpITONsk [RW [0 |__| Replay Timer Timeout Mask 
pay Ri Reserved OOS 


-_8 | RpiRolINsk | RW_ [0 |__| Replay_ Num Rollover Mask 
[7_[ BadDELPMs | RW [0 [| Bad DELP Mask —SSOSC—SCSCSY 
[6 _| Bad TLPMsk [RW [0 |_| Bad TEP Nask ——SOSOSC~—SCSCSY 
Part Sd SOR 0 Cid Reserved SOSCSC—SSSSSCSCSCSC“‘“*d 
-_0 | RevEnNsk [RW [0 |__| Receiver Eivor Mask ——SS—S—S—S—SCS 


13.12.38 Advanced Error Capabilities Control Register 


Description 


Register 
R_PcieAdvErrCapCtrl 


Address 
OxE_9800_0118 


Definitions 


prot RT 0 [Reed OS—S—SOSOSOSSCCCC—SY 
/8 | BORCOREER [RW [0 | _[1 = BORO Checking on mibownd packets enabled Sticky. | 
[7_[EORCCHECap | _R_ [1 _| | 1 = This device is capable of checking BORG. 


| 6 | ECRCGenEn | RW [ 0 | | 1=ECRC Generation Enabled. Sticky. 
ECRCGenCap[ R [| 1 | | 1= This device is capable of generating ECRC. 


4:0 | FstErrPtr R First Error Pointer. Identifies bit position of first error re- 
ported in the Uncorrectable Error Status register. Sticky. 
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13.12.39 Advanced Error Capabilities/Header Log Register (1st Dword) 
Description 


The Header Log is 4 Dwords and contains the header of the TLP that contained a detected error. All bits in 
Header Log register(s) are sticky. 


Register 
R_PcieHdrLog1 


Address 
OxE_9800_011C 


Definitions 


[Demin SSCSCSCC*d 


HdrLog1 0 First Header of TLP that contained a detected error. 
Sticky. 


13.12.40 Header Log Register (2nd Dword) 


Description 
Register 
R_PcieHdrLog2 


Address 
OxE_9800_0120 


Definitions 


HdrLog2 ede aed Second Header of TLP that contained a detected error. 
Sticky. 


13.12.41 Header Log Register (3rd Dword) 


Description 


Register 
R_PcieHdrLog3 


Address 
OxE_9800_0124 


Definitions 


[Dentin Sr 


HdrLog3 a ee Third Header of TLP that contained a detected error. 
Sticky. 
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13.12.42 Header Log Register (4th Dword) 


Description 
Register 
R_PcieHdrLog4 


Address 
OxE_9800_0128 


Definitions 


Fourth Header of TLP that contained a detected error. 


Sticky. 


13.12.43 Root Error Command Register 


Description 


This register allows the Root Complex to control reporting (ie. generating or disabling interrupts) of incoming 


ERROR messages. 


Register 
R_PcieRootErrCmd 


Address 
OxE_9800_012C 


Definitions 


Ee a Rao 


| P| |. 


13.12.44 Root Error Status Register 


Description 


The Root Error Status register reports the status of error messages (where these could be ERROR messages 


Reserved 

1 = Enable interrupt generation when ERR_FATAL mes- 
sage received. (Errata doc Page 111 says that the Root 
Port should generate interrupt wire only when Interrupt 
Disable bit in Command Register is 0 in addition to the 
above). 

1 = Enable interrupt generation when ERR-NONFATAL 
message received. (Errata doc Page 111 says that the 
Root Port should generate interrupt wire only when In- 
terrupt Disable bit in Command Register is 0 in addition 
to the above). 

1 = Enable interrupt generation when ERR_-COR mes- 
sage received. (Errata doc Page 111 says that the Root 
Port should generate interrupt wire only when Interrupt 
Disable bit in Command Register is 0 in addition to the 
above). 


received from other devices, or detected by the Root Port itself). Bits 6:0 of this register are sticky. 
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Register 
R_PcieRootErrSt 


Address 
OxE_9800_0130 


Definitions 


[Bit MnemonieP Reoass [Reset [Type [ Denton 


31:27 | MsgNum Se ate arPrea Ee caerTEaTEn TENEETT we allocated only 1 MSI interrupt number, this field 
is irrelevant for us. 


| 26:7 | 7 | | Reserved. eee 


ee sat l= Se eS or more fatal uncorrectable errors detected. This 
should not happen if SerrEn bit in Command register = 
0 (based on Errata doc Page 107). 

NFErrMsgRev | RW1C 1 = One or more non-fatal uncorrectable errors detected. 
This should not happen if SerrEn bit in Command register 
= 0 (based on Errata doc Page 107). 


RWIC{[ 0 | | 1=Bit 2 was set due to a FATAL error. 


re NFErrMul Pee ala 1 = Uncorrectable error detected while bit 2 was already 


NFErrRev aie 1 = Uncorrectable error detected (by Root Port or via 
ERR_(NON)FATAL message). This should not happen 
if SerrEn bit in Command register = 0 (based on Errata 
doc Page 107). 


Peder | (ied aed Peril es de Sane error detected while bit 0 was already 


eed iia 1 = Correctable error detected (by Root Port or via 
ERR _COR message). This should not happen if SerrEn 
bit in Command register = 0 (based on Errata doc Page 
107). 


13.12.45 Root Error Source Identification Register 


Description 


The Error Source Identification register (all of whose bits are sticky) keeps track of the requester ID of the first 
such ERROR message for a given category (correctable or uncorrectable). 


Register 
R_PcieRootErrSrcld 


Address 
OxE_9800_0134 


Definitions 


— 73 UncErrId Contains ReqID of uncorrectable error (message) detected 
when bit 2 is being set 


15:0 | CorErrld Contains ReqID of correctable error (message) detected 
when bit 0 is being set 


May 14, 2014 673 Rev 51328 


SiCortex Confidential CHAPTER 13. PCI EXPRESS SUBSYSTEM 


13.13. PMI Control and Status Registers 


13.13.1 Core Control Register 
Register 


R_PmiCoreCtrl 


Attributes 


-noregtest -kernel 


Address 


0xE_9800_1000 


R 


CORERSTN 


APPREQLIEXIT 
APPREQLIENTRY 


Application requests L1 exit 
Application requests L1 entry 


Rx Lane Flip Enable 
Tx Lane Flip Enable 


Write 1 to start the PCI link training. Typically 
after reset 


Application requests Hot Reset on PCIE link 


13.13.2 PMI Interrupt Summary Register 


Description 


This register is a summary of the various sources of intterrupts. The source of the interrupt must be cleared to 
clear bits that are not labelled RW1C. The state of the bits in this register are independent of the R-PmilntrEn 
register. 


Register 


R_Pmilntr 


Attributes 


-kernel 


Address 


OxE_9800_1008 
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Bit 
63:34 


Recess | Reset | Product 
R ICE9A | Reserved 
(Overlaps allowed) 
ICE9B+ | Reserved 
(Overlaps allowed) 


No} 


REQCOMPMULT 
CSIECCMULT 

REQECCMULT 
CCWSYCECCMULT | RW 


ICE9B+ | REQ received multiple errored completions 
CSI detected multiple ECC errors 


Bs 
a 


ICE9B- 


QLQyQ 


ICE9B- 


EE 


REQ detected multiple ECC errors 


Data from the SYC to the CCW had multiple 
BCom | 
ICE9B+ | Data from the CSW to the CCW had multiple 
e0Genm | 
ICE9B+ | Data from the CSW to the SYC had multiple 
[BOGemo 


PCI Data Link Layer Down Indication 


aN 
Q 


ICE9B- 


CCWCSWECCMULT 


Bs 
S 
Q 


SYCECCMULT 


Bs 
= 
Q 


=e 


y 
es 
=e 
Bs 
E 


- 


EQRST 


DATLINKDWN 
PMTLPBLK PM requests blocking of outbound 
[soncompltion TEP 
INTD 
INTC 
INTB 
INTA 
CORERR 
NFERR 


INTD Active 
INTC Active 


INTB Active 
INTA Active 


ee) 
SI] 3) a] ao] 5} 20] 20] 2] 


PME MSI 


a2 


PME 

21 TOACK 
VEN 
AERINT 
AERMSI 
PMEINT 
PMEMSI 
HPPME 
HPINT 
HPMSI 


wD 

=| =| = 
a one ond ond on os 
QQ) Qy,aqyaya 


cE 


~J 
ee) 

=|2 
Q 


Bs 
z 
Q 


i 
i 


63:40 
el 
eal 
al 
[eal 
ab 
a 
Ps] 
ara 
5 
i) 
ia 


CSIADRINT 
CSIECCINT 


a |ao ja 
ele) Q 


on 
QQ 


CSIWTOINT 
CSIDBIINT 


REQ detected an ECC error 
REQ received an errored completion 


Data from the SYC to the CCW had an ECC 
ee ed 
Data from the CSW to the CCW had an ECC 
ee 


oa es) 
3/2) =| 
Q}Q!}Q 


REQECCINT 
REQCOMPINT 
CCWSYCECCINT 


2 


2 CCWCSWECCINT 


2 


SYCECCINT Data from the CSW to the SYC had an ECC 


error 
Reserved 
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Register 


R_PmilntrEn 


Attributes 


-kernel 


Address 


O0xE_9800_1010 
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Bit 
63:34 


Recess | Reset | Product 
RW ICE9A | Reserved 
(Overlaps allowed) 
ICE9B+ | Reserved 
(Overlaps allowed) 


No} 


REQCOMPMULT 
CSIECCMULT 
REQECCMULT 
CCWSYCECCMULT 


ICE9B+ | REQ received multiple errored completions 
ICE9B+ | CSI detected multiple ECC errors 
ICE9B+ | REQ detected multiple ECC errors 


ICE9B+ | Data from the SYC to the CCW had multiple 
ICE9B+ | Data from the CSW to the CCW had multiple 
ICE9B+ | Data from the CSW to the SYC had multiple 


RC requests reset due to link down status 
PCI Data Link Layer Down Indication 


PM requests blocking of outbound 
non-completion TLPs. 


CCWCSWECCMULT 


SYCECCMULT 


REQRST 
Unused32 
DATLINKDWN 
PMTLPBLK 


oO 


INTD 
INTC 
INTB 
INTA 
CORERR 
NFERR 


PME 

TOACK 

VEN 

AERINT 
AERMSI 
PMEINT 
PMEMSI 
HPPME 
HPINT 

HPMSI 

nused12 
SIADRINT 
SIECCINT 
nused9 
SIWTOINT 
SIDBIINT 
nused6 
REQECCINT 
REQCOMPINT 
CCWSYCECCINT 


Q| a 


=e 
=e 


Data from the SYC to the CCW had an ECC 
Data from the CSW to the CCW had an ECC 


Received Vendor Message 
AER INT 


2 CCWCSWECCINT 


SYCECCINT Data from the CSW to the SYC had an ECC 
error 


Unused0 unused_0 


63:40 
aor 
page 
o_ 
pO 
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13.13.4 LED Blink Rate Register 
Register 
R_PmiLedBlinkRate 


Attributes 


-kernel 


Address 
OxE_9800_1018 


7 


“3S 0 ae Count in ICLK of the PWWR and ATTN in- 
dicators. (This count defines the high (and the 
low) time of the 50% duty cycle blink rate. 


13.13.5 Send Unlock Message Register 


Register 
R_PmiSndUnlkMsg 


Address 
OxE_9800_1028 


Definition 
| 63:1 | | == || sReserved 


a5 W1C Write 1 to send an unlock message out. Self- 
clearing bit 


13.13.6 Send Turnoff Message Register 


Register 
R_PmiSndTrnOffMsg 


Address 
OxE_9800_1030 


oaet: | | [|__| Reserved 


TT W1C Write 1 to send a turn-off message out. Self- 
clearing bit 


13.13.7 Link Status Register 


Register 
R_PmiLinkStat 


Attributes 


-kernel 
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Address 


OxE_9800_1038 
CS A 
14:12 | PMDS Power Management D-State 


LTSSMCS eae eal Link Training and Status State Machine Current 
State 


PMCS | R [| 0 | Power Management Current State 


DATLK | R | 0 | PCI Data Link Layer Up/Down Indication 
REQRST RC requests reset due to link down status. 
PHYLK PCI Phy Link Up/Down Indication 
R Q 


| 0 
PMTLPBLK Power management control to block schedule of 
new TLP requests 


13.13.8 Root-Complex Debug Info 


Register 


R_PmiRcDbg 


Address 


OxE_9800_1040 

reap i| OR | 0 [Rood SSOSCS~SC*S 
-it_[XSCRDS_[—R_ [0 _| Transmit Semmbler Diabet 
[9 PXERTRN [RO _| Transmit Gink in Trammg SSS 
pes pT dd Reserved SSCS 


2 DETLOOP R PIPE TxDetextRx/Loopback on. PHY is doing 
R 


a receiver detection or is in loopback mode 


1 TXEIDLE PIPE TxEleclIdle on. PHYtransmits electrical 


idle 


TXCOMPL R PIPE TxCompliance on. PHY transmits com- 
pliance patterns 


13.13.9 Force Ecc Error Register 


Description 


Used to artificially cause single bit ECC errors at various generators within the PMI. 


Register 


R_PmiFrcEccErr 


Address 


0xE_9800_2000 
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A 
EnECCCorr Enable correction of ECC errors 


3 SycBadDat1 RWS Flip bit 1 of word 0 of data coming out of the 
SYC write buffer. 


2 SycBadDat0 RWS Flip bit 0 of word 0 of data coming out of the 
SYC write buffer. 


1 CswBadDatl | RWS Flip bit 1 of word 0 of data going out on the 
CSW. 


ery CswBadDat0 free = | Flip bit 0 of word 0 of data going out on the 
CSW. 


13.13.10 CSI Ecc Error Register 


Description 


Debug information in the event an ECC error was detected by the CSI. This is for data coming from the CSW 
to the CSI. 


Register 
R_PmiCsiEccErr 


Address 
OxE_9800_2018 


reco | R | 0 [RamedSSSOSC~SS 

psosa [| _R | 0 | Reserved ———SSSOSCSCSSSSC‘*' 

[38_[Dbe [| R_|_0 | ewasa double bitaron SOS 
2 


5 Mult R Multiple Errors received since last serviced. 
Cleared when the corresponding multi-interrupt 
bit in the summary register is cleared. 


srt | Ong [—R_| 0 | The origin of the errored transaction. | 
45:36 | Synd___[ R | 0 _| Syndrome of the emored data. ———SSC—*S 
353 [Addr [_R _| 0_| Address of the errored transaction | 
p20 [| R [0 [Reserved SC—C“C~‘“S*~*~SY 


13.13.11 CSI Address Error Register 


Description 


Debug information in the event an out of range address was detected by the CSI. This is for commands coming 
from the CSW that are in the gross CSI range, but not in the range of any specific function within the CSI. To 
wit, the following must be true for the address to be valid: 

Addr[35:12] = OxE980xx where xx <= 0x30 

or 

the address is within the IoSCB space 

or 

the address is within the I2C space 

or 

the address is within the Uart space 


Register 
R_PmiCsiAdrErr 
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Address 


0xE_9800_2020 


rexcop | Rk | 0 [Raed SSSSCSCS~SC~*SY 
A 


52 Mult Multiple Errors received since last serviced. 
Cleared when the corresponding interrupt bit in 
the summary register is cleared. 


ii ee OR 
Fs a 
Essar dee a eleeeaci = 
P30 [Rf Reserved SCS 


13.13.12 DBI 64bit Access Error Register 


Description 


Debug information in the event a 64 bit access to the DBI was detected by the CSI. 


Register 


R_PmiCsiDbikErr 


Address 


O0xE_9800_2028 


reco, i R | 0 [Raed SSSOSCSC~S~S~*SY 
pees [RO Resend 


5 Mult Multiple Errors received since last serviced. 
Cleared when the corresponding interrupt bit in 
the summary register is cleared. 


51:44 Recta ee ee The origin of the errored transaction. 


| 43:36 | 36 Bae | 0 | Reserved 


0 [Reoved SSCS 
[a | ho i ———$-4 
P30 [Ro Reserved SY 


13.13.13 CSI Wishbone Timeout Error Register 


Description 


Debug information in the event a timeout occurred in a Wishbone transaction. 


Register 


R_PmiCsiWtoErr 


Address 


O0xE_9800_2030 
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rexcoy | R | 0 [Read SSCS 
pees [0 Reseed 


Mult Multiple Errors received since last serviced. 
Cleared when the corresponding interrupt bit in 
the summary register is cleared. 


A ae 
cs 
ee ee 
P20 [| R [0 [Reserved SOSC—SCSCSCSCS 


13.13.14 REQ Ecc Error Register 


Description 


Debug information in the event an ECC error was detected by the REQ. This is for data coming from the CSW 
to the REQ. 


Register 


R_PmiReqEccErr 


Address 
OxE_9800_2040 


rsoo[-+| R | 0 [Rowe SsSC=CS~S 
psosaf «| «oR | 0 | Reserved SSCS 
[38_| Bibel oor 


52 Mult Multiple Errors received since last serviced. 
Cleared when the corresponding multi-interrupt 
bit in the summary register is cleared. 


srt | Ong [| 0 _| The origin of the erored transaction. | 
45:36 | Synd [| R| 0 | Syndrome of the ermored data. ——S~S 
35:3 [Adar [| _R__| 0 _| Address of the errored transaction _____| 
Oe 


13.13.15 REQ Completion Error Register 


Description 


Debug information in the event an errored completion was received by the REQ from the Root Complex. 


Register 
R_-PmiReqCompErr 


Attributes 


-kernel 


Address 
OxE_9800_2048 
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63:60 | Reas = | R | 0 | Reason code for errored completion. 
a 


Mult Multiple Errors received since last serviced. 
Cleared when the corresponding multi-interrupt 
bit in the summary register is cleared. 


Ce a eC 
cc a 
ee eee 
P20 [| R [0 [Reserved SOSC—SSCSCSCS 


13.13.16 SYC CSW Ecc Error Register 


Description 


Debug information in the event an ECC error was detected by the SYC. This is for data coming from the CSW 
to the SYC. The address given is the PCI address of the completion or of the current completion segment. 


Register 


R_PmiSycCswEccErr 


Address 


0xE_9800_2050 

| 63:60 | | OR | 0 Reserved 

IE a OE 
a re ae as 


Mult Malin Errors received since last serviced. 
Cleared when the corresponding multi-interrupt 
bit in the summary register is cleared. 


| 51:44 | 44, Reserved 


cae 36 sagt} Syndrome of the errored data. 
| 35:3 | Addr = | ~=R_ | 0_| Address of the errored transaction. 
Pao | Ro 0 | Reserved 


13.13.17 CCW CSW Ecc Error Register 


Description 


Debug information in the event an ECC error was detected by the CCW. This is for data coming from the CSW 
to the CCW. 


Register 


R_PmiCcwCswEccErr 


Address 


O0xE_9800_2060 
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roxoo] [BR | 0 [Reeved Sd 
psossa [| R_[ 0 [ Reserved SSS 
33_[ Dbe 0 was a double bit anor 
Mult 


52 R Multiple Errors received since last serviced. 
Cleared when the corresponding multi-interrupt 
bit in the summary register is cleared. 


sat [Orig [|__| The origin of the errored transaction. | 
45:36 [Smd [|_| 0 _| Syndrome of the ervored data. ———S—=S 
[353 [Add | _R| 0 | Addhess of the enored transaction + 
P20 -f [| R [0 [Reserved SSCS 


13.13.18 CCW SYC Ecc Error Register 


Description 


Debug information in the event an ECC error was detected by the CCW. This is for data coming from the SYC 
to the CCW. 


Register 


R_PmiCewSycEccErr 


Address 
OxE_9800_2068 


reco | Rk | 0 [Remed SSSSOSCS~S 
psosaf—i| SR [ 0 | Reserved SSCS 
(33 [Dbe | R|_0[ewasa double bitanor SSCS 


Multiple Errors received since last serviced. 


Cleared when the corresponding multi-interrupt 
bit in the summary register is cleared. 
Syd 


_| Sasa 

R_|_0 | Syndrome of the enored data. 

[353 [Addr _[_R | 0 | Addhess of the erored transaction. | 
Peo PR | 


Reserved 


13.13.19 MSI Address Register 
Register 
R_PmiMsiAddr 


Attributes 


-kernel 


Address 
OxE_9800_3000 


| 63:6 | Addr = | RW [0 | Base address for the MSI range. 


Fe 
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13.13.20 Wishbone Timeout Value Register 
Register 
R_-PmiWbToVal 


Address 


OxE_9800_3008 
rest | _R | 0 | Resoned 


WBTOVAL Timeout value in CCLKs 


13.13.21 VSM Request Double Word 1 and 2 Register 
Register 
R_PmiVmReqDW12 


Address 
OxE_9801_0000 


[s56 | Umsedl | RW [0 [Umsedl——SSCSCSCSCS~*d 
5548 [CODE [RW | 0 | Codefed ——SSCSCSCSSSSSC‘*' 
F740 [Umsedo [RW [0 | Unused SSS 
3924 [REQID [RW | 0 | Requestor’ SSCS 
P23 [TD | _RW_| 0 | Digest present SSCS 

a 


| 0 | Attribute Field SY Field 

19:10 | LEN RW Pengih Field. Valid values are 0 and 1. Other 
i Ma 
RW [0 [TafficCass—SSCSC—~—~—SCS 


65_ [FMT [-RW_[ 0 | Format fold SSCS 
[£0_P TYPE [-RW_[ 0 [Typed SSCS 


13.13.22 VSM Request Double Word 3 and 4 Register 
Register 
R_PmiVmReqDW34 


Address 
OxE_9801_0008 


[es0 ADDR | RW | 0 [@ibiadiesSOSSSCS~C~S~S~*S 


13.13.23 VMI Request Data Register 
Register 
R_-PmiVmReqDat 


Attributes 


-writeonemixed 
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Address 


OxE_9801_0010 
EE A 


32 GO W1C Write 1 to initiate a Vendor Message Request. 
Software needs to setup the Vendor Message 
Data Registers and the Vendor Message Header 
Register before setting this flag. This is a self- 
resetting flag 


aro [DAT [RW | 0 | Optional data Tor the request 


13.13.24 Received Vendor Message Double Word 1 and 2 Register 
Register 
R_PmiRcvVenMsgDW12 


Address 
OxE_9801_0018 


resop | R | 0 [RamedSSOSCS~SSSCS~*SY 
548 [CODE [_R_| 0 | Code fed SSS 
Parad [TAG [_R_ | _0_|[Tagfeld SSCS 
Pam | REID | 0 eet 


Dae present. PRC has been configured to strip 
ECRC, hence this bit will likely always be 0. 


a2 EP TR _0_| Poisoned mdieator——SSCSCS~S~*Y 
Par20/ATTR_[_R | _0_| Attribute Field ——S—S~S 
P90 [TEN [| _R | _0 | TengthPeld SSCS 
97 [TC [Rk |_0_| Traffic Css —SSC—~—~—S 
65 [PMT [ R_|_0_| Format field SCS 
[40 [TyPE__[_R| 0 |Typefeld SCS 


13.13.25 Received Vendor Message Double Word 3 and 4 Register 
Register 
R_PmiRcvVenMsgDW34 


Address 

O0xE_9801_0020 

ADDR | R= | 0 | 64 bit address 
13.13.26 Received Vendor Message Payload Register 
Register 

R_PmiRcvVenMsgPld 


Address 
OxE_9801_0028 
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ressry dT SR [0 _| Reserved 
OVFLW 


R Received Vendor Message Overflow. Set if a Ven- 
dor Message received before previous message 
was serviced 


PAYLD | R | 0 | Received Vendor Message Payload 


13.14 PCI Express Phy Registers 


All of the registers in this section are within the PCI Express Phy. The contents of these registers come from 
the PClIel 90nM PHY Core Data Book. 


13.14.1 Less Than Limit Compare Point Register 


Description 


Less Than Limit Compare point 


Register 
R_PciePhyCrClockCrempLtLimit 


Address 
0xE98100008 


Pisa | GrempLibaan [RW [0x0 |__| boss Than Limit Compare point 


13.14.2 Greater Than Limit Compare Point Register 


Description 


Greater Than Limit Compare point 


Register 
R_PciePhyCrClockCrempGtLimit 


Address 
0xE98100010 


OxFFFF |__| Greater Than Limit Compare point. 


13.14.83. Compare/Scratch Value Mask Register 


Description 


Compare/Scratch value mask 


Register 
R_PciePhyCrClockCrempMask 


Address 
0xE98100018 


CrempMask OxFFFF |_| Compare/Scratch value mask. 
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13.14.4 Scratch Space Control Register 
Description 


Scratch space control bits 


Register 
R_PciePhyCrClockCrempCtl 


Address 


0xE98100020 


HoldScratchl | RW | 0 ~~ | ~———s«sY|s«éDon t update Scratch1 on register reads. 


| 0 | HoldScratchO| RW = =[0 =| ~—_|_ Don t update Scratch0 on register reads. 


13.14.5 Scratch Register Comparisons To Limits Results Register 
Description 


Results of scratch register comparisons to limits 


Register 
R_PciePhyCrClockCrcmpStat 


Address 
0xE98100028 


$1S0O0utside | RS Logical OR of S1_SO_LOW and $1_S0_HIGH useful to de- 
ae tennine ithe diferene 

S0Outside Logical OR of SOLOW and SO_HIGH useful to determine 
| if the value is near signed. 


3 S1SOHigh RS Xx Masked(Scratchl-ScratchO) is higher than CR- 
2 S1S0Low S Xx Masked(Scratch1-ScratchO) is lower than CR- 
CMP_LT_LIMIT. 


SOHigh }RS =| X | ‘| Masked Scratch0 is higher than CRCMP_HT_LIMIT. 
}O0 | SOLow =| RS [X [| Masked Scratch0 is lower than CRCMP_LT_LIMIT. 


13.14.6 Number Of Samples To Count Register 
Description 


Number of samples to count 


Register 
R_PciePhyCrClockScopeSamples 


Address 
0xE98100030 


0x100 |_| Number of samples to count. 
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13.14.7 Scope Counting Results Register 
Description 
Results of scope counting A write to this register will start the counting process The value of FFFF indicates 
counting still in progress 
Register 


R_PciePhyCrClockScopeCount 


Address 


0xE98100038 


15:0 | ScopeCount | RS Xx Results of scope counting A write to this register will start 
the counting proce. 


13.14.8 Support DAC Values And Controls Register 


Description 


Support DAC values and controls 


Register 


R_PciePhyCrClockDacCtl 


Address 


0xE98100040 


14:12 | DacMode RW 0x0 DAC output mode 0 - powered down 1 - unused 2 - hi- 
range margining (VP25*418e-6. 


OvrdRtuneRx |RW [0 | ~~]: Write DAC_VALJ5:0] to the Rx rtune bus. 


OvrdRtuneTx |RW = [0 ~~ | ~~]: Write DAC_VALJ5:0] to the Tx rtune bus. 


[9:0 [| DacVal | RW___| oxiFF |__| Digital value to use for DAG. 


13.14.9 Resistor Tuning Controls Register 
Description 


Resistor tuning controls 


Register 


R_PciePhyCrClockRtuneCtl 


Address 


0xE98100048 
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io [Adciig [RW [0 | | Meg ADO omension ———SOSSOSCSCS~S~CS~S~S~S~S~S 
Po [RimeTig_ [RW [0 |__| Tigger manual resistor calibration. 
Ps_[RiuneDis [RW [0 |__| Disable automatic resistor recalibration 


alee ee Wee let Invert output of comparator (to reverse SAR feed- back 
loop). 


[o_[DacChop [RW [0 |__| Polarity of chop control for DAC 


RW [1___[ __| Set xd in rescal cireuity. SSS 

Pa [Selatbp [RW [0 |__| Select atb-s-p for A/D measurement 

P3_[Pwronkel [RW [0 _[-___| Value of powerom to fore, SSCS 

[2 _[FePwron [RW [0 |__| Override intemal powerom 
ical 


Restune SAR mode 0 - normal restune 1 - ADC 2 - Rx 


Resistor test 3 - Tx Resistor. 
13.14.10 ADC Process Results Register 


Description 


Results of ADC process A read from this register starts a new A/D conversion 


Register 
R_PciePhyCrClockAdcOut 


Address 
0xE98100050 


|}10 | Fresh [RS [X [| Flag indicates that anew A/D conversion result is present. 


9:0 | Vale | RS. | X | | A/D conversion result Based on RTUNE_CTL. 


13.14.11 Current MPLL Phase Selector Value Register 


Description 


Current MPLL phase selector value 


Register 
R_PciePhyCrClockSsPhase 


Address 
0xE98100058 


Attributes 


-noregtest 


12 ZeroFreq RWS Current MPLL phase selector value Must be set for 
PHASE writes to stick. 


pre [val RWS__[00__ |__| Gument MPI. phase selector value 
[0 _[ Dir | RWS__[0x0__ |__| Current MPL. phase selector value 


13.14.12 JTAG Chip ID Register (Lower 16 Bits) 


Description 


Internal Chip ID used by JTAG - upper 16 bits Not unique between UP3_1.0 parts 
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Register 
R_PciePhyCrClockChipIdHi 


Address 
0xE98100060 


[Deinition SSCS 


15:0 | ChipIdHi 0x3005 Internal Chip ID used by 8 SSS upper 16 bits Not unique 
between UP3_1. 


13.14.13  JTAG Chip ID Register (Upper 16 Bits) 


Description 


Internal Chip ID used by JTAG - lower 16 bits Not unique between UP3_1.0 parts 


Register 
R_PciePhyCrClockChipIdLo 


Address 


0xE98100068 


15:0 | ChipIdLo R 0x4CD Internal Chip ID used by JTAG - lower 16 bits Not unique 
between UP3_1. 


13.14.14 Frequency Control Inputs Status Register 
Description 


Status of Frequency control inputs Reset value depends on inputs 


Register 
R_PciePhyCrClockFreqStat 


Address 
0xE98100070 


ris _[Reeved [RS [|X | | AhasradkawtSSSCSCSC~C“S*S*~“~*~*S 
Tas | Prescale | RS__[X |_| Prescalor control SSS 
pis [Ney | RSX |__| Divide by Tevcle control SSCS 


P76 [Nes RSX —*dDivide by Sento. SSCS 
pea [ tC [RS [X__[_ [Integral charge pump control SSS 
[20_[PropGl__[ RS__[X___ |__| Proportional charge pump controk_——S—S 


13.14.15 Various Control Inputs Status Register 
Description 


Status of various control inputs Reset value depends on inputs 


Register 
R_PciePhyCrClockCtlStat 
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Address 


0xE98100078 


[is [Reavedt [RS [X | Alysia —SOSsC~C~“~S*S*S*~“~*~*Y 
Pid FastTech [RS [X___ |__| Technology is fast SCS 
pis [Vplsip2 [RS [X__ |_| Tow voltage supply .—SSSOSOSCSCSCSCSCCCC—ST 
[ia [Vphisaps__ RSX |__| High voltage supplyis SSCS 
PT [WideXtace [RS _[X |__| Wide interface control _—————S—S—S—S 
[10 [RiuneDoTune [RS—__[X |__| Manual resistor tine controk SSCS 
Po [Reserved [RS [X___| | AlwaysreadsasTSSOSOSCSCSOSSCCCC‘*dY 
sm_[ CkoWord@on [RS _[X_ |__| Cko.word mux controh_————SSSSOSOSCSCSCSCSCS 
[5:4 [ CkoAliveCon [RS [X |__| Cko-alive mux control SSS 
-3__[Mpissim__[RS—__[X___ |__| Spread spectrum enable SCS 
[2 [MpiiPwaon [RS _[X___[___[ Mpt power-on control SSCS 
PT [Mpikor [RS _[X___[__[ Reference clock is of Sid 
Po [UseReferkark [RS [X |__| tse altematerefelk. ——S—SCS—SSCSCCCSC—CST 


13.14.16 Level Control Inputs Status Register 
Description 


Status of level control inputs Reset value depends on inputs 


Register 


R_PciePhyCrClockLv1Stat 


Address 


0xE98100080 


Pis__[ Reserved [RS [X |_| Alwaysradsas—SOSOS~—SCSCSCSCSCS 
Po |Txbvl_—_ RSX |_| Transmit level ——SCSC—SCSCSCSCSCS 


95 [Toskvt [RS [X___ |__| hoss of Signal Deteotorlvel ——SSOS—SCSCS 
Pa0_[Aeitivt PRS [X__ |__| AC Jag Comparatorleve SCS 


13.14.17 Creg Control I/O Status Register 


Description 


Status of creg control I/O Reset value depends on inputs 


Register 


R_PciePhyCrClockCregStat 


Address 


0xE98100088 
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[S_[Reavedl [RS [X | | Alwasradsas—SOSCSCSC~—SCSCSCS 
[7 _[OpDone [RS [X__[__| Operation is complete output 
Pe PowerGood [RS [X_|__| Power good output. SSS 
ps [Grd RSX] reg request Acknowledgement 
Pa [Reserved [RS [X__[_| Always readsast SSS 


[3 [CrCapadde [RS [XxX [__| Capture Address request SSCS 
[2 [CrCapData [RS [X_[___| Capture Data request. SSS 
pi [eawrite [RSX Wit request SSCS 
PO [CrRead [RSP X[[ Read request. SSS 


13.14.18 Frequency Control Inputs Override Register 
Description 


Override of Frequency control inputs 


Register 


R_PciePhyCrClockFreqOvrd 


Address 


0xE98100090 


3“ pS |_| Enable override of all bits in this register. 


14:13 | Prescale RW Prescaler control 00 - no scaling 01 - double refclk freq 10 
rete teeter} Se gee 
ee Wee lee by 4 yale control MPLL Divider Pe- 
ee el riod=4*(NCY+1)+NCY5 Valid only when NCY. 


ial RW Divide by control _MPLL Divider’ Pe- 

riod=4*(NCY+1)+NCY5 Valid only when 
NCY5<=NCY. 

lice ee al Integral charge pump control Integral current = 
pee ae n+1)/8*fullscale. 

PropCtl ee charge pump control Proportional current = 
afer p et iury/selL sal 


13.14.19 Various Control Inputs Override Register 
Description 


Override of various control inputs 


Register 


R_PciePhyCrClockCtlOvrd 


Address 


0xE98100098 
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Pis_[OwdSiatic [RWS [0 |__| Ovanido static contol (bis II). ——SSSSSCSCSCS~*d 
pia FastTech [RW [0 |__| Technologyis fast SSCS 
rs [Vpksip2 [RW [0 |__| Tow voltage supplys SSS 


High voltage supply is 3. 
5 ( 


Pia [vpnksps__[ RW fo _| 


— 
[0 | RimeDoTme [RW [0] 
eo 


= : 

[[ Mamal resistor tune controk 
Po [Odie RWS [0 [_Ovemide clock controls (bits SSS 
85 [ ChoWordGon [RW [ost |__| Cko-word mux control SSCS 
[5d_[ CkoAliveCon [RW [0x1 |__| Ckoalive mux control SSCS 
[3 [pissin [RW [0 |__| Spread spectrum enable. SSS 
[2 [ MpiiPwxon [RW [1 [___[ Mpll power-on control SSS 
[1 vpiakor [RW [0 [Reference clocks of. ——S—S—SCSCSCC*'T 
PO [UseRefelkaR [RW [0 |__| Use alfemate rofl. SSCS 


13.14.20 Level Control Inputs Override Register 


Description 


Override of level control inputs 


Register 
R_PciePhyCrClockLvlOvrd 


Address 
0xE981000A0 


fis_[owd [RWS [0 |_| Ownidealldomtos——SOSOSCS~SCS~*S 
Paro [Tadd [RW [0x10 |__| Transmit level ————SSSSSOSCSCSCSCSCSSSSC~*' 
[9:5 _[Toskel [RW [0x10 |__| Toss of Signal Detector love SSCS 
PO [Acti [RW [x10 [___ [AC JTag Comparator level SSS 


13.14.21 Creg Control I/O Override Register 


Description 


Override of creg control I/O 


Register 
R_PciePhyCrClockCregOvrd 


Address 
0xE981000A8 


Ps [OwdOut [RWS [0 |__| Ovenideoutpats (bis 7S) OSC~“~*~*~*~S~S 
P7_[OpDone [RW [0 |__| Operation is complete ontpat——S—S—SCS 
[6 [PowerGood [RW [1 |_| Power good output. SOSOSC—~—SCSCS 
Ps [Crack [RW_[0 |__| Greg request Acknowledgement. ——S~S 
Pa [Ovrdm [RWS [0 |__| Overnide mputs (bits 0). ——SSSCSC~—~S 


[3 [CrCapadde [RW [0 |__| Capture Address request. SSS 
[2 [CrCapData [RW [0 |__| Capture Data request. SSCS 
PT _[ewnite [RW [0 |__| Wiiterequest————SSSOSOSCSCSCSCSC*?r 
PO [GrRead RW fo Read request. OSS 
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13.14.22 MPLL Controls Register 
Description 


MPLL Controls 


Register 
R_PciePhyCrClockMpllCtl 


Address 


0xE981000B0 


[Dentin —SSCSCSCSCSCSCSCSCSCSC*d 


14:10 | DtbSell RW a of wire to drive onto DTB bit 1 0 - disabled 1 - 
DtbSel0 Select of wire to drive onto DTB bit 0 0 - disabled 1 - 
mpll_gear_shift 2 - mpll. 


RefclkDelay |RW [0 =| | Delay refclk output of prescaler. 


[3__[DisParaCreg | RW [0 |__| Disable Parallel creg xface. ——S—SCS 
[2 OwrdCikdey [RWS [0 |__| Overnide clock driver controls SCS 
__[OlkdreDig_[ RW [0 |__| Value for digital clock drivers. SS 
Po _[otkdrvAna [RW [0 |__| Value for analog clock drivers. SS 


13.14.23 MPLL Test Controls Register 


Description 


MPLL Test Controls 


Register 
R_PciePhyCrClockMpllTst 


Address 
0xE981000B8 


pis [Owdcd [RWS [0 |_| Ovenide MPI reso and gearshift contro = 
PT _| GearshifVal [RW [0 |__| Value to override for mplLgenrshift. __——S~S—S 
[is esetvat PAW [0 [et overie for mperset_ 


Measlv 0x0 Measure various mpll controls bit 12 - enable phase lin- 


earity testing of phase i. 


1 MeasGd RW Measure GD Should be set when various MEAS_IV bits 
are set for correct measureme. 


Po [aiSense [RW [0 |__| Hook up ATB sense Imes. SSS 


13.14.24 Transmit Control Inputs Status Register (Lane 0) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 


R_PciePhyCrLaneOTxStat 
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Address 


OxE98110008 

fis_[Reovedi [RS [xX | [Angora OOCSCSC~C~CSCSCSC~*d 
PTETs | TxRdgerate [RS [|X |__| Bagerate contro SSS 
2:10 [TxAtin [S| X |_| Attenuation amount control SSCS 
96 | TxBoost_[RS__[X |__| Boost amount control____———SSSSS—S—S— 


PS [Reserved [RS__[X |__| Always reads as. SCS 
PtP TSCikAlign [RS [|X| | Command to align docks SSCS 
Pat_[TxEn | RS__[X | | Transmit enable control SS 
POP TxCkoFn [RSX [|__| Tccko clock enable. ———SCSSSC~*r 


13.14.25 Receiver Control Inputs Status Register (Lane 0) 


Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLaneORxStat 


Address 
0xE98110010 


pid [Reseed [RS_[X |_| AlwaysreadsasSSOSOSCSCSCSCSCSCSCS~*d 
PisTz [Tos RS__[X__|__[ 08 filtermg mode control SSCS 
pir [DpliReset_— [RSX [Pi reset contro ——SSSSSCSCSCSCSCSCSCSCC—CSY 
[108 [RxDpliMode [RS—_[X | ___[ DPLE. mode controk SSS 
P75 _[RxFaVal__[RS—_[X___|___[ Equalization amount conol——SC—S 
pa RxTermen [RS __[X___[ | Receiver termination enable. SSCS 
[3 RxAlignEn [RS __[X___ |__| Receiver alignment enable _————*+Y 
pain RSX | Receiver enable control SSS 
PP RxPIPwron [RS__[X |__| PLL power state controk_——SS—S—S— 
[0 [HaliRate [RS _[X___[__[Diital halFrate data conmol SSCS 


13.14.26 Output Signals Status Register (Lane 0) 


Description 


Status of output signals Reset value depends on inputs 


Register 
R_PciePhyCrLane0OutStat 


Address 
0xE98110018 


Always reads a 

Transmit receiver detection result. 

|__| Transmit operation is complete output. 
2 R Loss 31 j 


p2 [tos J 
PoP Rava TRS] 
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13.14.27 Transmitter Control Inputs Override Register (Lane 0) 
Description 


Override of Transmitter control inputs 


Register 
R_PciePhyCrLaneOTxOvrd 


Address 


0xE98110020 

fis__[Owd [RWS [0 |_| Phableovenide ofall bits m this regen —SSSCSC=*Y 
Cs | TxBdgerate [RW [0x0 |_| Bdgerate control ——SSSSS—S~—~S 
P20 | TxAtten [RW | 0x0_| | Attenuation amount controk——S—S—SCS 
[9% | TxBoost_ [RW | 0x0_[ | Boost amount control___——S—S—S—SCSCS 


Ps [Reserved [RW [0 |_| NoeictSSSCSCSCSOSCSCCCC~C~*' 
Pf PTsCikAlign [RW [0 |__| Command to align cooks SSCS 
Pat [Ten [RW [03 |__| Transmit enable control SSS 
POP TxCkoFn [RW [1 |__| Tccko clock enable. —SCSCSSSCC—~*' 


13.14.28 Receiver Control Inputs Override Register (Lane 0) 


Description 


Override of Receiver control inputs 


Register 
R_PciePhyCrLaneORxOvrd 


Address 
0xE98110028 


ia [owd | RWS[0 |__| Bnablo overnide ofall bis mn his reaten = 
ise [Tost RW [0st |__| LOS fiftermg mode control ————SSOSOS~—S 
Pi [DpliReset__ [RW [0 |__| DPLL. reset controk__——SSSSCSCSSS 
[108 [RxDpiiMods [RW [0x [ [BPEL mode conrok SSS 
RxEqVal__[RW__[0x0__[ | Baualization amount control SS 
PReTemEn [RW [1 |__| Receiver temmination enable. 

PRW [1 | ____[ Receiver alignment enable. SSS 


RxPlPwron | RW [1 | ~———s|:- PLL power state control. 
[HaliRate [RW [0 [| Digital halfrate data control 


RW 
RW 
pa [Rn RW [| [ Receiver enable control SSCS 
RW 
Ea RW 


13.14.29 Output Signals Override Register (Lane 0) 
Description 


Override of output signals 


Register 
R_PciePhyCrLaneQOutOvrd 
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Address 


0xE98110030 


S_[Ond [RWS [0 |__| Pnableownide all bis mths nants 
Pa [asRxpres [RW [1 |__| Transmit receiver detection result. 
[3_[TxDone [RW [0 [Transmit operation is complete output. ————————| 


[2 [hos RW [0s of signal output. 
PT _[RxPistate_[ RW [0 _[___| Current state of Rx PIL—SOSOSOSOSCSCS 
PoP rxvaia RW [Receiver valid output SSS 


13.14.30 Debug Control Register (Lane 0) 


Description 


Debug control register 


Register 
R_PciePhyCrLaneODbgCtl 


Address 
OxE981 10038 


[Definition ——SCSC~—~—~—S—SCSCSd 


14:10 | DtbSell RW Select of wire to ae onto DTB bit 1 0 - disabled 1 - 
Pema PAV [ar Pen oo OTT TTT 

DtbSel0 RW Select of wire to hs onto DTB bit 0 0 - disabled 1 - 
fee ie lal! ieee | 


a DisableRxCk | RW [0 |__| Disable neck output, ——S—SCSSSS 
3 [Invert [RW 0 |__| Invert receive data (preIbot). ——SOSOS—~—~S—SCS 
[2 [invert [RW [0 |__| Tnvert transmit data (post-lbert) SS 
PT | ZeroRxData_[RW__[0 |__| Overvide all receive data to zeros——SSSCS~S~SCS 
PO | ZeroTxData [RW [0 |__| Overvide all transmit data to zeros——SSCS~S~S 


13.14.31 Pattern Generator Controls Register (Lane 0) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrLane0PgCtl 


Address 
0xE98110080 


aa [ Pa [RW [00 | | PattmformodsssSSC~C~—~SCSC~*d 
Ps [Tigger [RW [0 |__| Insert a single error mtoaSb—SS—S—SS 


| 2:0 | Mode =| RW | 0x0 =| ~—_—s{_- Pattern to generate 0 - disabled 1 - lfsr15. 


13.14.32 Pattern Matcher Controls Register (Lane 0) 
Description 


Pattern Matcher controls 
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Register 
R_PciePhyCrLane0OPmCtl 


Address 
OxE981 100CO0 


[Definition —SCSCSCSCSCSCSCSCSCSCSC*Sd 


Syne RW ————— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode RW Pattern to match 0 - disabled 1 - lfsr15 2 - lfsr7 3 - d[{n 
= d{n-10] 4 - d{n] = 


13.14.33 Pattern Match Error Counter Register (Lane 0) 


Description 

Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 
Register 

R_PciePhyCrLaneOPmErr 


Address 
0xE981100C8 


Attributes 


-noregtest 


fis [Ovid [RWS [X |_| active, multiply COUNT Oy PS SSCSCSC~* 


14:0 | Count RWS xX Current error count If OV14 field is active, then multiply 
count by 128. 


13.14.34 Current Phase Selector Value. Register (Lane 0) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrLane0Phase 


Address 
0xE981100D0 


Attributes 


-noregtest 


ior [val RWS [0x0 | «(| Canent phase solector value ——SSSSOSCSCS 
po [Dime RWS fo [Current phase selector value 
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13.14.35 Current Frequency Integrator Value. Register (Lane 0) 
Description 


Current frequency integrator value. 


Register 


R_PciePhyCrLane0Freq 


Address 


0xE981100D8 


Attributes 


-noregtest 


Pisa [val RWS [0x0 |__| Current frequency integrator value 


po [Dine RWS [0 [Current frequency integrator value 


13.14.36 Scope Control Register (Lane 0) 
Description 


Control bits for per-transceiver scope portion 


Register 


R_PciePhyCrLane0ScopeCtl 


Address 


0xE981100EK0 


14:11 | Base === [RW [0x0 =| ‘|: Which bit to sample when MODE = 1. 
| 10:2 | Delay =| RW [0x0 =| ‘| Number of symbols to skip between samples. 


1:0 Mode RW 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.37 Recovered Domain Receiver Control Register (Lane 0) 


Description 


Control bits for receiver in recovered domain 


Register 


R_PciePhyCrLaneORxCtl 


Address 


0xE981100E8 
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SwitchVal }RW [0 | _|[ Value to override the data/phase mux. 
OvrdSwitch Override the value of the data/phase mux. 
ModeBp RW 


12:10 [iia Set BP 2:0 to longer timescale (for FTS patterns) BPO - 


Start PHUG profile at 4/. 


38 [RugVahe [RW 00_[—__] Ovenide vale for RUGS 
P76 | PhngValne [RW _[00__[ | Overvide value for PHUG.————SSS—S—S——S 


OvrdDpllGain ;-RW [0 |. Override PHUG and FRUG values. 
PhdetPol |-RW [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge RW Edges to use for phase detection top bit is rising edges, 
bottom is falling. 


PhdetEn foe ee | Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.38 Receiver Debug Register (Lane 0) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrLaneORxDbg 


Address 
0xE981100F0 


DtbSel1 }RW | 0x0 =| ~—_[ Select wire to go on DTB bit 1. 


DtbSel0 |RW [0x0 =| ~———_—[ Select wire to go on DTB bit 0. 


13.14.39 RX Control Register (Lane 0) 
Description 


RX Control Bits 


Register 
R_PciePhyCrLaneORxAnaCtrl 


Address 
0xE98110180 


Attributes 


-noregtest 


Se a 

}4 | RxlbiEfn |RW [0 ~~ [|__| Digital serial (internal) loopback enable bit. 

[3 [Reba | RW—| 0 || Wafer level (external) loopback enable bit. 

AOE [0 ek 95 enable bie 
a 
POF nate 
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13.14.40 RX ATB Register (Lane 0) 
Description 


RX ATB bits 


Register 
R_PciePhyCrLaneORxAnaAtb 


Address 
0xE98110188 


Attributes 


-noregtest 
[5 [SensemVrefLos [RW [0 [|__| Connect atb_sm to vrel_los (vrelrx/14). 


Pa [SensemVem [RW [0 |__| Connect atbamtoRXvon. SSS 
[3 _[SensomRxM— [RW [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [RW [0 |__| Connect atbaptomep. SSCS 
PT [ ForeepRxMt [RW [0 [|__| Connect atb-Ep to nom. 
PO | ForcepRxP [RW [0 |__| Connect atb-tptonep. SSS 


13.14.41 8 Bit Programming Register (Lane 0) 
Description 


8 bit programming register 


Register 
R_PciePhyCrLane0PllPrg2 


Address 
0xE98110190 


Attributes 


SNOPES ERD 


[Denton ——s—C—~“*S*S*SC“‘“S*S*S*SCS*S 
AtbSenseSel RW Control of ——— charge pump current 1=Enable 
signals internal to the PLL. 
FrcHcpl RW Allow override of default value of hep] 1=allow hcpl_lcl to 
ae high-couplin. 


aa HeplLcl ee | 1=force coupling in vco to maximum. ssid 


FrcPwron |_| ie override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | Pwronkcl | 10 =| ‘| 1=power is supplied to the PLL. 


Se ocr Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


[Reeth RW [0 [___[ T= PEL is held placed in reset 


P| EnableTestPd pee 1=phase linearity of phase interpolator and VCO is being 
tested. 
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13.14.42 10 Bit Programming Register (Lane 0) 
Description 


10 bit programming register 


Register 
R_PciePhyCrLaneO0PlPrgl 


Address 
0xE98110198 


Attributes 


-noregtest 


Po [Unset [RW [1 |__| Unusoa 


SelRxck }RW [0 =| ~__[ Use recovered clock as reference to the PLL. 
7:5 | PropCntrl RW Ox5 rol Control of Proportional charge pump current Propor- 


tional current = (n+1)/8*fulL. 


4:2 | IntCntrl RW 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*full_scale De. 


Peo [Unsed RW os sed SOCSCSOSOSOSCSCSCSCSCSC*d 


13.14.43 10 Bit Programming Register (Lane 0) 


Description 


10 bit programming register 


Register 
R_PciePhyCrLane0PllMeas 


Address 
0xE981101A0 


Mnemonié 
Measure copy of bias current in oscillator on atb_force_m. 


MeasVcntrl Measure ventrl on atb_sense_m If MEAS_VREF is set as 


i 
— 
oO 
© 
iy 
w 
hs 
© 
a 


well, atb_sense_p,m mea- su. 

MeasVref Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 

Measure vp16 on atb_sense_p; gd on atb_sense_m. 
MESSIAH Measure startup voltage on atb_sense_p; gd on 
atb_sense_m. 


< 
— 
oO 
© 
n 
< 
ue) 
eH 
for) 


Measure vco supply voltage on atb_sense_p; gd on 

atb_sense_m. 

MeasVpCp Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 

If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 

MEAS_VP_CP is set as well, atb_sense. 

MeasCrowbar Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


= = 
oO oO 
je) fev) 
ae Z 
< = 

(oe) 


9 |} 
Lali 
}6 |} 
5 
oes 
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13.14.44 TX ATB Control Register (Set 1) (Lane 0) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane0TxAnaAtbsell 


Address 


0xE981101A8 


Attributes 


-noregtest 


VbpfSP RW Vbpf in edge rate control circuit on ATB_S_P Set 
cll lt ATBLEN to take this well 

-e_ | Benswr | 
S| tent? [RW [0 [Tr connected fo ATES? Por terme J 


al a Txp connected to ATB. S_P Set ATB- EN to make this 
useful. 


eee ae Soe ee 
PT [wretsP__ [RW po id ewehOOC—SSCSOSOOCCC“‘;C~*d 
ro [VvessP__[Rw foi Re SSCSC—C—SSSCSCC 


13.14.45 TX ATB Control Register (Set 2) (Lane 0) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane0TxAnaAtbsel2 


Address 


0xE981101B0 


Attributes 
-noregtest 
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[Deinition SSCS 


AtbEn RW Connect internal and external ee busses Needed for all 
ATB measurements. 
ies = We Renes <5". 0 Se et oP | 


o_ [Weiss RW] 


Peas PR Vi ER ATEN TT Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM RW a aie in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 

VbpssP RW Vbps in edge rate control circuit on ATB_S_M Set 
oad ATB_EN to make this useful. 

VbnfiSM RW Vbnf in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


Enlpbk Rian ee Enable TX external loopback Make sure internal loopback 
is not ON 


[OP eatpbk [RW [0 |__| Bable TX tatermal Toopback—— 


13.14.46 TX POWER STATE Control Register (Lane 0) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrLaneOTxAnaControl 


Address 
0xE981101B8 


Attributes 


enORCB Ee 


[Definition ——SSCSCSCSCSCSCSCSCSCSCSCS*S 


FrcPwrst RW ——————— force power state tx_en<1:0> input overridden by 
EN_LCL. 


EnLcl Locally force tx_en<1:0> 00 - power off 01 - tx idle (slow) 
10 - transmit data 1. 


FrcDo RW Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


hal DataovrdLcl ae el Local dataovrd control value Set FRC_DO to make this 
useful. 


FrcBeacon Force Beacon to local value (BCN_LCL) When On, 
BCN_LVL overrides input an 


1 BenLcl RW Local Beacon On/Off Control Value Set FRC_BEACON 
to make this useful. 


PO [Unused RW Oi Unusedreg SSCS 


13.14.47 Transmit Control Inputs Status Register (Lane 1) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane1TxStat 
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Address 


OxE98110808 

Pis_[Reovedi [RS [x |_| Anaoradses>OSCSCSCSCSCSC~*d 
TTS | TxRdgerate [RS [|__| Bagerate contro SSS 
2:10 [TxAtin [S| X |_| Attenuation amount control SSCS 
96 | TxBoost_[RS__[X |__| Boost amount control____———SSSSS—S—S— 


PS [Reserved [RS__[X |__| Always reads as. SCS 
PtP TSCikAlign [RS [|X| | Command to align docks SSCS 
Pat_[TxEn | RS__[X | | Transmit enable control SS 
POP TxCkoFn [RSX |__| Tccko clock enable. ——SCSSSCCCC~S 


13.14.48 Receiver Control Inputs Status Register (Lane 1) 


Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane1RxStat 


Address 
0xE98110810 


pid [Reserved | RS__[X |__| AlwaysreadsasSOSOSCSCSCSCSCSCSCSC~S~*d 
P1sTz [Tos RS__[X__|__[ 108 filtermg mode control———SSSSOS—SSC—S 
pir [DpliReset_— [RSX [Pi reset contro ——SSSSSCSCSCSCSCSCSCSCC—CSY 
[108 [RxDpliMode [RS—_[X | ___[ DPLE. mode controk SSS 
P75 _[RxFaVal__[RS—_[X___|___[ Equalization amount conol——SC—S 
pa RxTermen [RS __[X___[ | Receiver termination enable. SSCS 
[3 RxAlignEn [RS __[X___ |__| Receiver alignment enable _————*+Y 
pain RSX | Receiver enable control SSS 
PP RxPIPwron [RS___[X___ |__| PLL. power state controk_———SS—S—S 
[0 [HaliRate [RS [X__[___[Diital halFrate data contol SSCS 


13.14.49 Output Signals Status Register (Lane 1) 


Description 


Status of output signals Reset value depends on inputs 


Register 
R_PciePhyCrLanel OutStat 


Address 
0xE98110818 


Always reads a 

Transmit receiver detection result. 

|__| Transmit operation is complete output. 
2 R Loss 31 j 


p2 [tos J 
PoP Rava TRS] 
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13.14.50 Transmitter Control Inputs Override Register (Lane 1) 
Description 


Override of Transmitter control inputs 


Register 


R_PciePhyCrLanelTxOvrd 


Address 


0xE98110820 

fis__[Owd [RWS [0 |_| Phableovensde ofall bits m this regen ——SSSCS—=*Y 
Cs | TxBdgerate [RW [0x0 |_| Bdgerate control ——SSSSS—S~—~S 
P20 | TxAtten [RW | 0x0_| | Attenuation amount controk——S—S—SCS 
[9% | TxBoost_ [RW | 0x0_[ | Boost amount control___——S—S—S—SCSCS 


Ps [Reserved [RW [0 |_| NoeictSSSCSCSCSOSCSCCCC~C~*' 
Pf PTsCikAlign [RW [0 |__| Command to align cooks SSCS 
Pat [TxEn [RW [03 |__| Transmit enable control SS 
POP TxCkoEn [RW [1 |__| Tccko clock enable. —SCSCSSSCCC—~*' 


13.14.51 Receiver Control Inputs Override Register (Lane 1) 


Description 


Override of Receiver control inputs 


Register 
R_PciePhyCrLanelRxOvrd 


Address 
0xE98110828 


ia _[owd | RWS [0 |__| Bnable ovenide ofall bis nhs reanten = 
ise [Tost RW [0st |__| LOS fiftermg mode control ———SSOSCS~—S 
Pi [DpliReset__ [RW [0 |__| DPLL. reset controk__——SSSSCSCSSS 
[108 [RxDpiiMods [RW [0x [ [BPEL mode conrok SSS 
RxEqVal__[RW__[0x0__[ | Baualization amount control SS 
PReTemEn [RW [1 |__| Receiver temmination enable. 

PRW [1 | ____[ Receiver alignment enable. SSS 


RxPlPwron | RW [1 | ~———s|:- PLL power state control. 
[HaliRate [RW [0 [| Digital halfrate data control 


RW 
RW 
pa [Rn RW [| [ Receiver enable control SSCS 
RW 
re RW 


13.14.52 Output Signals Override Register (Lane 1) 
Description 


Override of output signals 


Register 
R_PciePhyCrLanelOutOvrd 
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Address 


0xE98110830 


[S_[Ond [RWS [0 |__| Pnableownide all bis mths neato 
Pa [asRxpres [RW [1 |__| Transmit receiver detection result. 
[3 [TxDone [RW [0 [Transmit operation is complete output. ————————| 


[2 [hos RW [0 oss of signal output. 
PT _[RxPistate_[ RW [0 _[___| Current state of Rx PIL—SSOSOSOSCSCSCS 
Po [xvid RW [1 [Receiver valid output OSS 


13.14.53 Debug Control Register (Lane 1) 


Description 


Debug control register 


Register 
R_PciePhyCrLanelDbgCtl 


Address 
OxE981 10838 


[Definition ——SCSC~—~—~—S—SCSCSd 


14:10 | DtbSell RW Select of wire to ae onto DTB bit 1 0 - disabled 1 - 
Peppa PAV [ar Pent coo DITTO 

DtbSel0 RW Select of wire to hs onto DTB bit 0 0 - disabled 1 - 
fee ie lal! ieee | 


a DisableRxCk | RW [0 |__| Disable neck output, ——S—SCSSSS 
3 [Invert [RW 0 |__| Invert receive data (preIbot). ——SOSOS—~—~S—SCS 
[2 [invert [RW [0 |__| Tnvert transmit data (post-lbert) SS 
PT | ZeroRxData_[ RW [0 |__| Overvide all receive data to zeros—SSSCS~SCS 
-0_[ZeroTxData [RW [0 |__| Overvide all transmit data to zeros SS 


13.14.54 Pattern Generator Controls Register (Lane 1) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrLanelPgCtl 


Address 
0xE98110880 


aa [Pa [RW [00 | | PattmformodsssSSOSC~—~SCSC~*S 
Ps [Tigger [RW [0 |__| Insert a single error mtoasb—SSSOS—S—SCS 


| 2:0 | Mode =| RW | 0x0 =| ~—_—_—s{_: Pattern to generate 0 - disabled 1 - lfsr15. 


13.14.55 Pattern Matcher Controls Register (Lane 1) 
Description 


Pattern Matcher controls 
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Register 
R_PciePhyCrLanelPmCtl 


Address 
OxE981 108C0 


[Definition —SCSCSCSCSCSCSCSCSCSCSC*Sd 


Syne RW ————— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode RW Pattern to match 0 - disabled 1 - lfsr15 2 - lfsr7 3 - d[{n 
= d{n-10] 4 - d{n| = 


13.14.56 Pattern Match Error Counter Register (Lane 1) 


Description 

Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 
Register 

R_PciePhyCrLanelPmErr 


Address 
0xE981108C8 


Attributes 


-noregtest 


fis [Ovid [RWS [X |_| active, multiply COUNT Oy PS SSCSCSC~* 


14:0 | Count RWS xX Current error count If OV14 field is active, then multiply 
count by 128. 


13.14.57 Current Phase Selector Value. Register (Lane 1) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrLanel Phase 


Address 
0xE981108D0 


Attributes 


-noregtest 


por [val RWS [0x0 | «(| Canent phase soector value ——SSSOSCSCSCS~S 
po [Dime RWS [0 Carent phase selectorvalue 
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13.14.58 Current Frequency Integrator Value. Register (Lane 1) 
Description 


Current frequency integrator value. 


Register 


R_PciePhyCrLanelFreq 


Address 


0xE981108D8 


Attributes 


-noregtest 


Pisa [val RWS [0x0 |__| Current frequency integrator value 


po [Dine RWS [0 [___[ Current frequency integrator value 


13.14.59 Scope Control Register (Lane 1) 
Description 


Control bits for per-transceiver scope portion 


Register 


R_PciePhyCrLanelScopeCtl 


Address 


0xE981108EK0 


14:11 | Base == = [RW [0x0 | ~—__—|: Which bit to sample when MODE = 1. 
| 10:2 | Delay =| RW [0x0 =| ‘| Number of symbols to skip between samples. 


1:0 Mode RW 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.60 Recovered Domain Receiver Control Register (Lane 1) 


Description 


Control bits for receiver in recovered domain 


Register 


R_PciePhyCrLane1RxCtl 


Address 


0xE981108E8 
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SwitchVal }RW [0 | _|[ Value to override the data/phase mux. 
OvrdSwitch Override the value of the data/phase mux. 
ModeBp RW 


12:10 [iia Set BP 2:0 to longer timescale (for FTS patterns) BPO - 


Start PHUG profile at 4/. 


38 [RugVahe [RW 00_[—__] Ovenide vale for RUGS 
P76 | PhngValne [RW _[00__[ | Overvide value for PHUG.————SSS—S—S——S 


OvrdDpllGain ;-RW [0 |. Override PHUG and FRUG values. 
PhdetPol |-RW [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge RW Edges to use for phase detection top bit is rising edges, 
bottom is falling. 


PhdetEn foe ee | Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.61 Receiver Debug Register (Lane 1) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrLanel1RxDbg 


Address 
0xE981108F0 


DtbSel1 |RW | 0x0 =| ~—_[ Select wire to go on DTB bit 1. 


DtbSel0 |RW [0x0 =| ~————[ Select wire to go on DTB bit 0. 


13.14.62 RX Control Register (Lane 1) 
Description 


RX Control Bits 


Register 


R_PciePhyCrLane1RxAnaCtrl 


Address 
0xE98110980 


Attributes 


-noregtest 


Se a 

}4 | RxlbiEfn |RW [0 ~~ [|__| Digital serial (internal) loopback enable bit. 

[3 [Reba | RW—| 0 || Wafer level (external) loopback enable bit. 

AOE [0 ek 95 enable bie 
a 
a 
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13.14.63 RX ATB Register (Lane 1) 
Description 


RX ATB bits 


Register 
R_PciePhyCrLane1RxAnaAtb 


Address 
0xE98110988 


Attributes 


-noregtest 
[5 [SensemVrefLos [RW [0 [|__| Connect atb_sm to vrel_los (vrelrx/14). 


Pa [SensemVem [RW [0 |__| Connect atbamtoRXvon. SSS 
[3 _[SensomRxM— [RW [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [RW [0 |__| Connect atbaptomep. SSCS 
PT [ Forceps [RW [0 [|__| Connect atb-Ep to nom. 
Po | ForcepRxP [RW [0 |__| Connect atb-tptonep. SSS 


13.14.64 8 Bit Programming Register (Lane 1) 
Description 


8 bit programming register 


Register 
R_PciePhyCrLane1PllPrg2 


Address 
0xE98110990 


Attributes 


SNOPES ERD 


[Denton ——s—C—~“*S*S*SC“‘“S*S*S*SCS*S 
AtbSenseSel RW Control of ——— charge pump current 1=Enable 
signals internal to the PLL. 
FrcHcpl RW Allow override of default value of hep] 1=allow hcpl_lcl to 
ae high-couplin. 


aa HeplLcl ee | 1=force coupling in vco to maximum. ssid 


FrcPwron |_| ie override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | Pwronkcl | 10 =| ‘| 1=power is supplied to the PLL. 


Se ocr Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


[Reeth RW [0 [___[ T= PEL is held placed in reset 


P| EnableTestPd pee 1=phase linearity of phase interpolator and VCO is being 
tested. 
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13.14.65 10 Bit Programming Register (Lane 1) 
Description 


10 bit programming register 


Register 
R_PciePhyCrLane1PllPrg1 


Address 
0xE98110998 


Attributes 


-noregtest 


Po [Unset [RW [1 |__| Unusoa 


SelRxck }RW [0 =| ~__[ Use recovered clock as reference to the PLL. 
7:5 | PropCntrl RW Ox5 rol Control of Proportional charge pump current Propor- 


tional current = (n+1)/8*fulL. 


4:2 | IntCntrl RW 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*full_scale De. 


D0 [Unased RW os sed SOSOSOSOSOSOSCSCSCSCSCSC*d 


13.14.66 10 Bit Programming Register (Lane 1) 


Description 


10 bit programming register 


Register 
R_PciePhyCrLanel PlliMeas 


Address 
0xE981109A0 


Mnemonié 
Measure copy of bias current in oscillator on atb_force_m. 


MeasVcntrl Measure ventrl on atb_sense_m If MEAS_VREF is set as 


< 
S 
0) 
© 
n 
w 
a 
© 
mn 


well, atb_sense_p,m mea- su. 

MeasVref Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 

Measure vp16 on atb_sense_p; gd on atb_sense_m. 
MESSIAH Measure startup voltage on atb_sense_p; gd on 
atb_sense_m. 


< 
— 
oO 
© 
n 
< 
ue) 
eH 
for) 


Measure vco supply voltage on atb_sense_p; gd on 

atb_sense_m. 

MeasVpCp Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 

If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 

MEAS_VP_CP is set as well, atb_sense. 

MeasCrowbar Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


= = 
oO oO 
je) fev) 
ae Z 
< = 

(oe) 


}9 |} 
Lali 
}6 |} 
5 
oes 
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13.14.67 TX ATB Control Register (Set 1) (Lane 1) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane1TxAnaAtbsell 


Address 


0xE981109A8 


Attributes 


-noregtest 


VbpfSP RW Vbpf in edge rate control circuit on ATB_S_P Set 
cll lt ATBLEN to take this well 

-e_ | Benswr | 
S| tent? [RW [0 [Tr connected fo ATES? Por terme J 


al a Txp connected to ATB. S_P Set ATB- EN to make this 
useful. 


eee ae Soe ee 
PT [wretsP__ [RW po id ewehOOC—SSCSOSOOCCC“‘;C~*d 
ro [VvessP__[Rw foi Re SSCSC—C—SSSCSCC 


13.14.68 TX ATB Control Register (Set 2) (Lane 1) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane1TxAnaAtbsel2 


Address 


0xE981109B0 


Attributes 
-noregtest 
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[Deinition SSCS 


AtbEn RW Connect internal and external ee busses Needed for all 
ATB measurements. 
ies = We Renes <5". 0 Se et oP | 


o_ [Weiss RW] 


Peas PR Vi ER ATEN TT Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM RW a aie in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 

VbpssP RW Vbps in edge rate control circuit on ATB_S_M Set 
oad ATB_EN to make this useful. 

VbnfiSM RW Vbnf in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


Enlpbk Rian ee Enable TX external loopback Make sure internal loopback 
is not ON 


[OP eatspbk [RW [0 [| Bable TX tatermal Toopback—— 


13.14.69 TX POWER STATE Control Register (Lane 1) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrLane1TxAnaControl 


Address 
0xE981109B8 


Attributes 


enORCB Ee 


[Definition ——SSCSCSCSCSCSCSCSCSCSCSCS*S 


FrcPwrst RW ——————— force power state tx_en<1:0> input overridden by 
EN_LCL. 


EnLcl Locally force tx_en<1:0> 00 - power off 01 - tx idle (slow) 
10 - transmit data 1. 


FrcDo RW Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


hal DataovrdLcl ae el Local dataovrd control value Set FRC_DO to make this 
useful. 


FrcBeacon Force Beacon to local value (BCN_LCL) When On, 
BCN_LVL overrides input an 


1 BenLcl RW Local Beacon On/Off Control Value Set FRC_BEACON 
to make this useful. 


PO [Umsed RW Oi Uusedreg SSCS 


13.14.70 Transmit Control Inputs Status Register (Lane 2) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane2TxStat 
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Address 


OxE98111008 

fis [Reovedi [RS [xX | _[AhaoradsestOCOCSCSCSCSC~*d 
TTS | TxRdgerate | RS___[ |__| Bagerate contro SSS 
2:10 [TxAtin [S| X |_| Attenuation amount control SSCS 
96 | TxBoost_[RS__[X |__| Boost amount control____———SSSSS—S—S— 


PS [Reserved [RS__[X |__| Always reads as. SCS 
PtP TSCikAlign [RS [|X| | Command to align docks SSCS 
Pat_[TxEn [RSX |_| Transmit enable control SSS 
POP TxCkoFn [RSX |__| Tccko clock enable. ——SCSSCC~*r 


13.14.71 Receiver Control Inputs Status Register (Lane 2) 


Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane2RxStat 


Address 
0xE98111010 


pid [Reseed [RS [X |_| AlwaysreadsasSSOSOCSCSCSCSCSCSCSCSCSCS~*d 
P1sTz [Tos RS__[X__|__[ 108 filtermg mode control SSCS 
pir [DpliReset_— [RSX [Pi reset contro ——SSSSSCSCSCSCSCSCSCSCC—CSY 
[108 [RxDpliMode [RS—_[X | ___[ DPLE. mode controk SSS 
P75 _[RxFaVal__[RS—_[X___|___[ Equalization amount conol——SC—S 
pa RxTermen [RS __[X___[ | Receiver termination enable. SSCS 
[3 RxAlignEn [RS __[X___ |__| Receiver alignment enable _————*+Y 
pain RSX | Receiver enable control SSS 
PP RxPiPwron [RS__[X___ |__| PLL power state controk SSS 
[0 HaliRate [RS _[X__[__[Diital halFrate data contol SSCS 


13.14.72 Output Signals Status Register (Lane 2) 


Description 


Status of output signals Reset value depends on inputs 


Register 
R_PciePhyCrLane2OutStat 


Address 
0xE98111018 


Always reads a 

Transmit receiver detection result. 

|__| Transmit operation is complete output. 
2 R Loss 31 j 


p2 [tos J 
PoP Rava TRS] 
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Ee 
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PRS [X_[ [Toss of signal outputs SSSCSC~S 
eT 
Px [Receiver valid output SSS 
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13.14.73 Transmitter Control Inputs Override Register (Lane 2) 
Description 


Override of Transmitter control inputs 


Register 


R_PciePhyCrLane2TxOvrd 


Address 


0xE98111020 

fis__[Owd [RWS [0 |_| Phableovensde ofall bits m this regtor——SSSCSCS—=*Y 
CTs | TxBdgerate [RW [0x0 |__| Bdgerate control ——SSSSOS—S~—S 
P20 | TxAtten [RW | 0x0_| | Attenuation amount controk——S—S—SCS 
[9% | TxBoost_ [RW | 0x0_[ | Boost amount control___——S—S—S—SCSCS 


Ps [Reserved [RW [0 |_| NoeictSSSCSCSCSOSCSCCCC~C~*' 
Pf PTsCikAlign [RW [0 |__| Command to align cooks SSCS 
Pat [TxEn [RW [03 |__| Transmit enable control SS 
PoP TxCkoEn [RW [1 |__| Tccko clock enable. ———SCSCSSCCCCC~*' 


13.14.74 Receiver Control Inputs Override Register (Lane 2) 


Description 


Override of Receiver control inputs 


Register 
R_PciePhyCrLane2RxOvrd 


Address 
0xE98111028 


ia _[owd | RWS[0 |__| Bnableovennide ofall bis nhs rater = 
ise [Tost RW [0st |__| E08 fiftermg mode control. ————SSS—S 
Pi [DpliReset__ [RW [0 |__| DPLL. reset controk__——SSSSCSCSSS 
[108 [RxDpiiMods [RW [0x [ [BPEL mode conrok SSS 
RxEqVal__[RW__[0x0__[ | Baualization amount control SS 
PReTemEn [RW [1 |__| Receiver temmination enable. 

PRW [1 | ____[ Receiver alignment enable. SSS 


RxPlPwron | RW [1 | ~———s|:- PLL power state control. 
[HaliRate [RW [0 [| Digital halfrate data control 


RW 
RW 
pa [Rn RW [| [ Receiver enable control SSCS 
RW 
re RW 


13.14.75 Output Signals Override Register (Lane 2) 
Description 


Override of output signals 


Register 
R_PciePhyCrLane2OutOvrd 
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Address 


0xE98111030 


S_[Ond [RWS [0 |__| Pnableownide fall bis m this neato 
Pa [asRxpres [RW [1 |__| Transmit receiver detection result. 
[3 [TxDone [RW [0 [Transmit operation is complete output. ————————| 


[2 [hos RW [0 oss of signal output. 
PT _[RsPistate_[ RW [0 _[__| Current state of Rx PIL—SSSOSOSCSOSCSCS 
Po [Rxvaia RW [1 [Receiver valid output OSS 


13.14.76 Debug Control Register (Lane 2) 


Description 


Debug control register 


Register 
R_PciePhyCrLane2DbgCtl 


Address 
OxE981 11038 


[Definition ——SCSC~—~—~—S—SCSCSd 


14:10 | DtbSell RW Select of wire to ae onto DTB bit 1 0 - disabled 1 - 
Peas [ar Pen oo OTT TTT 

DtbSel0 RW Select of wire to hs onto DTB bit 0 0 - disabled 1 - 
fee ie lal! ieee | 


a DisableRxCk | RW [0 |__| Disable neck output, ——S—SCSSSS 
3 [Invert [RW 0 |__| Invert receive data (preIbot). ——SOSOS—~—~S—SCS 
[2 [invert [RW [0 |__| Tnvert transmit data (post-lbert) SS 
PT | ZeroRxData_[RW__[0 |__| Overvide all receive data to zeros——SSSCS~S~SCS 
PO _[ZeroTxData [RW [0 |__| Overvide all transmit data to zeros SS 


13.14.77 Pattern Generator Controls Register (Lane 2) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrLane2PgCtl 


Address 
0xE98111080 


aa [Pa [RW [00 | | PattmformodsssSSCSC~—~—SCSC~*d 
Ps [Tigger [RW [0 |__| Insert a single error mtoasb—SSS—S—SCS 


| 2:0 | Mode =| RW | 0x0 =| ~——s{_- Pattern to generate 0 - disabled 1 - lfsr15. 


13.14.78 Pattern Matcher Controls Register (Lane 2) 
Description 


Pattern Matcher controls 
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Register 
R_PciePhyCrLane2PmCtl 


Address 
OxE981 110C0 


[Definition —SCSCSCSCSCSCSCSCSCSCSC*Sd 


Syne RW ————— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode RW Pattern to match 0 - disabled 1 - lfsr15 2 - lfsr7 3 - d[{n 
= d{n-10] 4 - d{n] = 


13.14.79 Pattern Match Error Counter Register (Lane 2) 


Description 

Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 
Register 

R_PciePhyCrLane2PmErr 


Address 
0xE981110C8 


Attributes 


-noregtest 


fis [Ovid [RWS [X |_| active, multiply COUNT Oy PS SSCSCSC~* 


14:0 | Count RWS xX Current error count If OV14 field is active, then multiply 
count by 128. 


13.14.80 Current Phase Selector Value. Register (Lane 2) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrLane2Phase 


Address 
0xE981110D0 


Attributes 


-noregtest 


por [val RWS [0x0 | «(| Canent phase soector value ——SSSOSCSCSCS~S 
po [Dime RWS fo Carrent phase selector value 
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13.14.81 Current Frequency Integrator Value. Register (Lane 2) 
Description 


Current frequency integrator value. 


Register 


R_PciePhyCrLane2Freq 


Address 


0xE981110D8 


Attributes 


-noregtest 


Pisa [val RWS [0x0 |__| Current frequency integrator value 


Po [Dine RWS [0 [__[ Current frequency integrator value 


13.14.82 Scope Control Register (Lane 2) 
Description 


Control bits for per-transceiver scope portion 


Register 


R_PciePhyCrLane2ScopeCtl 


Address 


0xE981110E0 


14:11 | Base === [RW [0x0 | ~——__—|: Which bit to sample when MODE = 1. 
| 10:2 | Delay =| RW [0x0 =| ‘| Number of symbols to skip between samples. 


1:0 Mode RW 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.83 Recovered Domain Receiver Control Register (Lane 2) 


Description 


Control bits for receiver in recovered domain 


Register 


R_PciePhyCrLane2RxCtl 


Address 


0xE981110E8 
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SwitchVal }RW [0 | _|[ Value to override the data/phase mux. 
OvrdSwitch Override the value of the data/phase mux. 
ModeBp RW 


12:10 [iia Set BP 2:0 to longer timescale (for FTS patterns) BPO - 


Start PHUG profile at 4/. 


38 [RugVahe [RW 00_[—__] Ovenide vale for RUGS 
P76 | PhngValne [RW _[00__[ | Overvide value for PHUG.————SSS—S—S——S 


OvrdDpllGain ;-RW [0 |. Override PHUG and FRUG values. 
PhdetPol |-RW [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge RW Edges to use for phase detection top bit is rising edges, 
bottom is falling. 


PhdetEn foe ee | Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.84 Receiver Debug Register (Lane 2) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrLane2RxDbg 


Address 
0xE981110F0 


DtbSel1 |RW | 0x0 =| ~—_[ Select wire to go on DTB bit 1. 


DtbSel0 |RW [0x0 =| ~——_—[ Select wire to go on DTB bit 0. 


13.14.85 RX Control Register (Lane 2) 
Description 


RX Control Bits 


Register 


R_PciePhyCrLane2RxAnaCtrl 


Address 
0xE98111180 


Attributes 


-noregtest 


Se a 

}4 | RxlbiEfn |RW [0 ~~ [|__| Digital serial (internal) loopback enable bit. 

[3 [Reba | RW—| 0 || Wafer level (external) loopback enable bit. 

AOE [0 ek 95 enable bie 
Po | [Margin enable bit 
POF nate 
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13.14.86 RX ATB Register (Lane 2) 
Description 


RX ATB bits 


Register 
R_PciePhyCrLane2RxAnaAtb 


Address 
0xE98111188 


Attributes 


-noregtest 
[5 [SensemVrefLos [RW [0 [|__| Connect atb_sm to vrel_los (vrelrx/14). 


Pa [SensemVem [RW [0 |__| Connect atbamtoRXvon. SSS 
[3 _[SensomRxM— [RW [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [RW [0 |__| Connect atbaptomep. SSCS 
PT [ ForeepRxMt [RW [0 [|__| Connect atb-Ep to nom. 
PO | ForcepRxP [RW [0 |__| Connect atb-ptonep. SSS 


13.14.87 8 Bit Programming Register (Lane 2) 
Description 


8 bit programming register 


Register 
R_PciePhyCrLane2PllPrg2 


Address 
0xE98111190 


Attributes 


OPES LEE! 


[Denton ——s—C—~“*S*S*SC“‘“S*S*S*SCS*S 
AtbSenseSel RW Control of ——— charge pump current 1=Enable 
signals internal to the PLL. 
FrceHcpl RW Allow override of default value of hep] 1=allow hcpl_lcl to 
ae high-couplin. 


aa HeplLcl ee | 1=force coupling in vco to maximum. ssid 


FrcPwron |_| ie override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | Pwronkcl | 10 =| ‘| 1=power is supplied to the PLL. 


Se ocr Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


[Reeth RW [0 [___[ T= PEL is held placed in reset 


P| EnableTestPd pee 1=phase linearity of phase interpolator and VCO is being 
tested. 
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13.14.88 10 Bit Programming Register (Lane 2) 
Description 


10 bit programming register 


Register 
R_PciePhyCrLane2PllPrgl 


Address 
0xE98111198 


Attributes 


-noregtest 


Po [Unset [RW [1 |__| Unusoa 


SelRxck }RW [0 =| ~__[ Use recovered clock as reference to the PLL. 
7:5 | PropCntrl RW Ox5 rol Control of Proportional charge pump current Propor- 


tional current = (n+1)/8*fulL. 


4:2 | IntCntrl RW 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*full_scale De. 


Peo [Cased RW os sed SOSCSCSCSOSOSCSCOCSCSCSCSC*S 


13.14.89 10 Bit Programming Register (Lane 2) 


Description 


10 bit programming register 


Register 
R_PciePhyCrLane2PllMeas 


Address 
0xE981111A0 


Mnemonié 
MeasBias 
MeasVecntrl Measure ventrl on atb_sense_m If MEAS_VREF is set as 
well, atb_sense_p,m mea- su. 

Meas Vref Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 

Measure vp16 on atb_sense_p; gd on atb_sense_m. 
MESSIAH Measure startup voltage on atb_sense_p; gd on 
atb_sense_m. 


ka b b b 
Ss 

oO 

& ‘ ‘ e 
vy A D1 DH 
< J 
so} 

aad 2 
aD 


Measure vco supply voltage on atb_sense_p; gd on 

atb_sense_m. 

MeasVpCp Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 

If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 

MEAS_VP_CP is set as well, atb_sense. 

MeasCrowbar Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


= = 
oO oO 
je) fev) 
ae Z 
< = 

(oe) 


Po 
a 
}6 |} 
5 

oes 


May 14, 2014 723 Rev 51328 


SiCortex Confidential CHAPTER 13. PCI EXPRESS SUBSYSTEM 


13.14.90 TX ATB Control Register (Set 1) (Lane 2) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane2TxAnaAtbsell 


Address 


OxE981111A8 


Attributes 


-noregtest 


VbpfSP RW Vbpf in edge rate control circuit on ATB_S_P Set 
cll lt ATBLEN to take this well 

-e_ | Benswr | 
S| tent? [RW [0 [Tr connected fo ATES? Por terme J 


al a Txp connected to ATB. S_P Set ATB- EN to make this 
useful. 


eee ae Soe ee 
PT [wretsP__ [RW po id ewehOOC—SSCSOSOOCCC“‘;C~*d 
ro [VvessP__[Rw foi Re SSCSC—C—SSSCSCC 


13.14.91 TX ATB Control Register (Set 2) (Lane 2) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane2TxAnaAtbsel2 


Address 


0xE981111B0 


Attributes 
-noregtest 


May 14, 2014 724 Rev 51328 


SiCortex Confidential 13.14. PCI EXPRESS PHY REGISTERS 


[Deinition SSCS 


AtbEn RW Connect internal and external ee busses Needed for all 
ATB measurements. 
ies = We Renes <5". 0 Se et oP | 


o_ [Weiss RW] 


Peas PR Vi ER ATEN TT Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM RW a aie in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 

VbpssP RW Vbps in edge rate control circuit on ATB_S_M Set 
oad ATB_EN to make this useful. 

VbnfiSM RW Vbnf in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


Enlpbk Rian ee Enable TX external loopback Make sure internal loopback 
is not ON 


[OP eatpbk [RW [0 [| Bable TX tatermal Toopback—— 


13.14.92. TX POWER STATE Control Register (Lane 2) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrLane2TxAnaControl 


Address 
0xE981111B8 


Attributes 


enORCB Ee 


[Definition ——SSCSCSCSCSCSCSCSCSCSCSCS*S 


FrcPwrst RW ——————— force power state tx_en<1:0> input overridden by 
EN_LCL. 


EnLcl Locally force tx_en<1:0> 00 - power off 01 - tx idle (slow) 
10 - transmit data 1. 


FrcDo RW Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


hal DataovrdLcl ae el Local dataovrd control value Set FRC_DO to make this 
useful. 


FrcBeacon Force Beacon to local value (BCN_LCL) When On, 
BCN_LVL overrides input an 


1 BenLcl RW Local Beacon On/Off Control Value Set FRC_BEACON 
to make this useful. 


PO [Unset RW [Oi Umusedreg SSCS 


13.14.93 Transmit Control Inputs Status Register (Lane 3) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane3TxStat 
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Address 


OxE98111808 

fis_[Reovedi [RS [xX | __|Ahaoradses>OSCSC~C~SCSCSC~*d 
TTS | TxRdgerate | RS___[ |__| Bagerate contro SSS 
2:10 [TxAtin [S| X |_| Attenuation amount control SSCS 
96 | TxBoost_[RS__[X |__| Boost amount control____———SSSSS—S—S— 


PS [Reserved [RS__[X |__| Always reads as. SCS 
PtP TSCikAlign [RS [|X| | Command to align docks SSCS 
Pat_[TxEn [RSX |_| Transmit enable control SSS 
POP TxCkoEn [RSX [|__| Tccko clock enable. ——S—SCSCSCSCCC~*r 


13.14.94 Receiver Control Inputs Status Register (Lane 3) 


Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane3RxStat 


Address 
0xE98111810 


pid [Reseed [RS_[X |__| AlwaysreadsasSSSOSCSCSCSCSCSCSCS~*d 
pisTz [Tos RS__[X__ |__| 08 fiftermg mode control SSCS 
pir [DpliReset_— [RSX [Pi reset contro ——SSSSSCSCSCSCSCSCSCSCC—CSY 
[108 [RxDpliMode [RS—_[X | ___[ DPLE. mode controk SSS 
P75 _[RxFaVal__[RS—_[X___|___[ Equalization amount conol——SC—S 
pa RxTermen [RS __[X___[ | Receiver termination enable. SSCS 
[3 RxAlignEn [RS __[X___ |__| Receiver alignment enable _————*+Y 
pain RSX | Receiver enable control SSS 
PP RxPIPwron_[RS___[X___ |__| PLL. power state controk_———SSSS—S—S— 
[0 HaliRate [RS [X___[__[Diital halFrate data conmol SSCS 


13.14.95 Output Signals Status Register (Lane 3) 


Description 


Status of output signals Reset value depends on inputs 


Register 
R_PciePhyCrLane3OutStat 


Address 
0xE98111818 


Always reads a 

Transmit receiver detection result. 

|__| Transmit operation is complete output. 
2 R Loss 31 j 


p2 [tos J 
PoP Rava TRS] 
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13.14.96 Transmitter Control Inputs Override Register (Lane 3) 
Description 


Override of Transmitter control inputs 


Register 
R_PciePhyCrLane3TxOvrd 


Address 


0xE98111820 

fis__[Owd [RWS [0 |_| Phableovennde ofall bits m this regen —SSSCSC=*d 
Cs | TxBdgerate [RW [0x0 |_| Bdgerate control ——SSSSSCS—C~S 
P20 | TxAtten [RW | 0x0_| | Attenuation amount controk——S—S—SCS 
[9% | TxBoost_ [RW | 0x0_[ | Boost amount control___——S—S—S—SCSCS 


Ps [Reserved [RW [0 |_| NoeictSSSCSCSCSOSCSCCCC~C~*' 
Pf PTsCikAlign [RW [0 |__| Command to align cooks SSCS 
Pat [TxEn [RW [03 |__| Transmit enable control SSS 
POP TxCkoEn [RW [Tt __ |__| Tceko clock enable. ———SCSCSSCSCCSC—CS 


13.14.97 Receiver Control Inputs Override Register (Lane 3) 


Description 


Override of Receiver control inputs 


Register 
R_PciePhyCrLane3RxOvrd 


Address 
0xE98111828 


ia [owd | RWS[0 |__| Bnabloowenide ofall bis mnths reaten = 
ise [Tost RW [0st |__| E08 fiftermg mode control. ————SSS—S 
Pi [DpliReset__ [RW [0 |__| DPLL. reset controk__——SSSSCSCSSS 
[108 [RxDpiiMods [RW [0x [ [BPEL mode conrok SSS 
RxEqVal__[RW__[0x0__[ | Baualization amount control SS 
PReTemEn [RW [1 |__| Receiver temmination enable. 

PRW [1 | ____[ Receiver alignment enable. SSS 


RxPlPwron | RW [1 | ~———s|:- PLL power state control. 
[HaliRate [RW [0 [| Digital halfrate data control 


RW 
RW 
pa [Rn RW [| [ Receiver enable control SSCS 
RW 
eI RW 


13.14.98 Output Signals Override Register (Lane 3) 
Description 


Override of output signals 


Register 
R_PciePhyCrLane30utOvrd 
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Address 


0xE98111830 


[S_[Ond [RWS [0 |__| nableownide hall bis m this eat 
Pa [isRxpres [RW [1 |__| Transmit receiver detection result. 
[3 [TxDone [RW [0 [Transmit operation is complete output. ————————| 


[2 [hos RW [0 oss of signal output. 
PT _[RxPistate_[ RW [0 __[__| Current state of Rx PIL—SOSOSOSCSCSCS 
Po [Rxvaia RW [Receiver valid output SSCS 


13.14.99 Debug Control Register (Lane 3) 


Description 


Debug control register 


Register 
R_PciePhyCrLane3DbgCtl 


Address 
OxE981 11838 


[Definition ——SCSC~—~—~—S—SCSCSd 


14:10 | DtbSell RW Select of wire to ae onto DTB bit 1 0 - disabled 1 - 
Pepa PAV [ar Pent oo DITTO RTT 

DtbSel0 RW Select of wire to hs onto DTB bit 0 0 - disabled 1 - 
fee ie lal! ieee | 


a DisableRxCk | RW [0 |__| Disable neck output, ——S—SCSSSS 
3 [Invert [RW 0 |__| Invert receive data (preIbot). ——SOSOS—~—~S—SCS 
[2 [invert [RW [0 |__| Tnvert transmit data (post-lbert) SS 
PT | ZeroRxData_[RW__[0 |__| Overvide all receive data to zeros—SSS~SCS 
PO | ZeroTxData [RW [0 |__| Overvide all transmit data to zeros SSS 


13.14.100 Pattern Generator Controls Register (Lane 3) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrLane3PgCtl 


Address 
0xE98111880 


aa [Pa [RW [00 | | PattmformodsssSSCSC~—~SCSC~*d 
Ps [ Tigger [RW [0 |_| Insert a single error mtoaSb SSS 


| 2:0 | Mode =| RW | 0x0 =| ~_i[{ Pattern to generate 0 - disabled 1 - lfsr15. 


13.14.101 Pattern Matcher Controls Register (Lane 3) 
Description 


Pattern Matcher controls 
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Register 
R_PciePhyCrLane3PmCtl 


Address 
OxE981 118C0 


[Definition —SCSCSCSCSCSCSCSCSCSCSC*Sd 


Syne RW ————— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode RW Pattern to match 0 - disabled 1 - lfsr15 2 - lfsr7 3 - d[{n 
= d{n-10] 4 - d{n| = 


13.14.102 Pattern Match Error Counter Register (Lane 3) 


Description 

Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 
Register 

R_PciePhyCrLane3PmErr 


Address 
0xE981118C8 


Attributes 


-noregtest 


fis [Ovid [RWS [X |_| active, multiply COUNT Oy PS SSCSCSC~* 


14:0 | Count RWS xX Current error count If OV14 field is active, then multiply 
count by 128. 


13.14.103 Current Phase Selector Value. Register (Lane 3) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrLane3P hase 


Address 
0xE981118D0 


Attributes 


-noregtest 


por [val RWS [0x0 | «(| Canent phase soector value ——SSSOSCSCSCS~S 
po [Dime RWS [0 Carrent phase selector value 
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13.14.104 Current Frequency Integrator Value. Register (Lane 3) 
Description 


Current frequency integrator value. 


Register 


R_PciePhyCrLane3Freq 


Address 


0xE981118D8 


Attributes 


-noregtest 


Pisa [val RWS [0x0 |__| Curent frequency integrator value 


[opine RWS [0 [Current frequency integrator value 


13.14.105 Scope Control Register (Lane 3) 
Description 


Control bits for per-transceiver scope portion 


Register 


R_PciePhyCrLane3ScopeCtl 


Address 


0xE981118E0 


14:11 | Base = [RW [0x0 | |: Which bit to sample when MODE = 1. 
| 10:2 | Delay =| RW [0x0 =| ‘| Number of symbols to skip between samples. 


1:0 Mode RW 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.106 Recovered Domain Receiver Control Register (Lane 3) 


Description 


Control bits for receiver in recovered domain 


Register 


R_PciePhyCrLane3RxCtl 


Address 


0xE981118E8 
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SwitchVal }RW [0 | _|[ Value to override the data/phase mux. 
OvrdSwitch Override the value of the data/phase mux. 
ModeBp RW 


12:10 [iia Set BP 2:0 to longer timescale (for FTS patterns) BPO - 


Start PHUG profile at 4/. 


38 [RugVahe [RW 00_[—__] Ovenide vale for RUGS 
P76 | PhngValne [RW _[00__[ | Overvide value for PHUG.————SSS—S—S——S 


OvrdDpllGain ;-RW [0 |. Override PHUG and FRUG values. 
PhdetPol |-RW [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge RW Edges to use for phase detection top bit is rising edges, 
bottom is falling. 


PhdetEn fee ee | Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.107 Receiver Debug Register (Lane 3) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrLane3RxDbg 


Address 
OxE981118F0 


DtbSel1 |RW | 0x0 =| ~~—_[ Select wire to go on DTB bit 1. 


DtbSel0 |RW [| 0x0 =| ~————[ Select wire to go on DTB bit 0. 


13.14.108 RX Control Register (Lane 3) 
Description 


RX Control Bits 


Register 
R_PciePhyCrLane3RxAnaCtrl 


Address 
0xE98111980 


Attributes 


-noregtest 


Se a 

}4 | RxlbiEfn |RW [0 ~~ [|__| Digital serial (internal) loopback enable bit. 

[3 [Reba | RW—| 0 || Wafer level (external) loopback enable bit. 

AOE [0 ek 95 enable bie 
Po | [Margin enable bit 
a 
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13.14.109 RX ATB Register (Lane 3) 
Description 


RX ATB bits 


Register 
R_PciePhyCrLane3RxAnaAtb 


Address 
0xE98111988 


Attributes 


-noregtest 
[5 [SensemVrefLos [RW [0 [|__| Connect atb_sm to vrel_los (vrelrx/14). 


Pa [SensemVem [RW [0 |__| Connect atbamtoRXvon. SSS 
[3 _[SensemRxM [RW [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [RW [0 |__| Connect atbaptomep. —SSS—S—SC—SCSCSCSC—S 
PT | ForeopRsMt [RW [0 [|__| Connect atb-Ep to nom. 
PO | ForcepRxP [RW [0 |__| Connect atb-Eptonep. SSS 


13.14.110 8 Bit Programming Register (Lane 3) 
Description 


8 bit programming register 


Register 
R_PciePhyCrLane3PllPrg2 


Address 
0xE98111990 


Attributes 


OPES LEE! 


[Denton ——s—C—~“*S*S*SC“‘“S*S*S*SCS*S 
AtbSenseSel RW Control of ——— charge pump current 1=Enable 
signals internal to the PLL. 
FrceHcpl RW Allow override of default value of hep] 1=allow hcpl_lcl to 
ae high-couplin. 


aa HeplLcl ee | 1=force coupling in vco to maximum. ssid 


FrcPwron |_| ie override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | Pwronkcl | 10 =| ‘| 1=power is supplied to the PLL. 


Se ocr Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


[Reeth RW [0 [___[ T= PEL is held placed in reset 


P| EnableTestPd pee 1=phase linearity of phase interpolator and VCO is being 
tested. 
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13.14.111 10 Bit Programming Register (Lane 3) 
Description 


10 bit programming register 


Register 
R_PciePhyCrLane3PlPrgl 


Address 
0xE98111998 


Attributes 


-noregtest 


Po [Unset [RW [1 [|__| Unser 


SelRxck }RW = [0 =| ~_[ Use recovered clock as reference to the PLL. 
7:5 | PropCntrl RW Ox5 Poa! Control of Proportional charge pump current Propor- 


tional current = (n+1)/8*fulL. 


4:2 | IntCntrl RW 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*full_scale De. 


Po [Cased RW ost sed SOCOCSCSCSOSOSCSCSCSCSCSC*d 


13.14.112 10 Bit Programming Register (Lane 3) 


Description 


10 bit programming register 


Register 
R_PciePhyCrLane3PllMeas 


Address 
0xE981119A0 


Mnemonié 
Measure copy of bias current in oscillator on atb_force_m. 


MeasVcntrl Measure ventrl on atb_sense_m If MEAS_VREF is set as 


< 
S 
0) 
© 
n 
w 
a 
© 
mn 


well, atb_sense_p,m mea- su. 

MeasVref Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 

Measure vp16 on atb_sense_p; gd on atb_sense_m. 
MESSIAH Measure startup voltage on atb_sense_p; gd on 
atb_sense_m. 


< 
— 
oO 
© 
n 
< 
ue) 
eH 
for) 


Measure vco supply voltage on atb_sense_p; gd on 

atb_sense_m. 

MeasVpCp Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 

If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 

MEAS_VP_CP is set as well, atb_sense. 

MeasCrowbar Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


= = 
oO oO 
je) fev) 
ae Z 
< = 

(oe) 


9 |} 
Lali 
}6 |} 
5 
oes 
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13.14.113 TX ATB Control Register (Set 1) (Lane 3) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane3TxAnaAtbsell 


Address 


0xE981119A8 


Attributes 


-noregtest 


VbpfSP RW Vbpf in edge rate control circuit on ATB_S_P Set 
cll lt ATBLEN to take this well 

-e_ | Benswr | 
S| tent? [RW [0 [Tr connected fo ATES? Por terme J 


al a Txp connected to ATB. S_P Set ATB- EN to make this 
useful. 


eee ae Soe ee 
PT [wretsP__ [RW fo id ewe OSOC—SOSSOOCCCC“‘CSC*S 
ro [vesP___[Rw foi Re SSCSC—SSCSCC 


13.14.114 TX ATB Control Register (Set 2) (Lane 3) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane3TxAnaAtbsel2 


Address 


0xE981119B0 


Attributes 
-noregtest 
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[Deinition SSCS 


AtbEn RW Connect internal and external ee busses Needed for all 
ATB measurements. 
ies = We Renes <5". 0 Se et oP | 


o_ [Weiss RW] 


Peas PR Vi ER ATEN TT Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM RW a aie in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 

VbpssP RW Vbps in edge rate control circuit on ATB_S_M Set 
oad ATB_EN to make this useful. 

VbnfiSM RW Vbnf in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


Enlpbk Rian ee Enable TX external loopback Make sure internal loopback 
is not ON 


[OP eatsapbk [RW [0 |__| Bable TX taternal Toopback——— 


13.14.115 TX POWER STATE Control Register (Lane 3) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrLane3TxAnaControl 


Address 
0xE981119B8 


Attributes 


enORCB Ee 


[Definition ——SSCSCSCSCSCSCSCSCSCSCSCS*S 


FrcPwrst RW ——————— force power state tx_en<1:0> input overridden by 
EN_LCL. 


EnLcl Locally force tx_en<1:0> 00 - power off 01 - tx idle (slow) 
10 - transmit data 1. 


FrcDo RW Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


aa DataovrdLcl aa ae Local dataovrd control value Set FRC_DO to make this 
useful. 


FrcBeacon Force Beacon to local value (BCN_LCL) When On, 
BCN_LVL overrides input an 


1 BenLcl RW Local Beacon On/Off Control Value Set FRC_BEACON 
to make this useful. 


Po [Unused RW Oi Uusedreg SSCS 


13.14.116 Transmit Control Inputs Status Register (Lane 4) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane4TxStat 
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Address 


OxE98112008 

pis_[Reovedt [RS [x |_| AhaoradsstOCOC~CSCSC~*d 
PTaTs | TxRdgerate [RS [|X| | Bagerate contro SSS? 
2:10 [TxAtien | RS__[X |_| Attenuation amount control SSCS 
36 | TxBoost_[RS__[X |__| Boost amount control____—SSSS—S—S—S— 


PS [Reserved _[RS__[X |__| Always reads as. ———SOSOSCSCSCSSCSC—S 
PoP TSCikAlign [RS [|X| | Command to align docks SSCS 
Pat_[TxEn | RS__[ | | Transmit enable control SSS 
POP TxCkoFn[RS—_[X [|__| Tccko clock enable. ———SCSCSSCCC~* 


13.14.117 Receiver Control Inputs Status Register (Lane 4) 


Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane4RxStat 


Address 
0xE98112010 


pid [Reened [RS [X |_| AlwaysreadsasSSOSOSCSCSCSCSCSCSCS~*d 
PisTz [Tos RS__[X___|__[ 08 filtermg mode control SSCS 
pir [DpliReset_— [RSX [Pi reset contro ——SSSSSCSCSCSCSCSCSCSCC—CSY 
[108 [RxDpliMode [RS—_[X | ___[ DPLE. mode controk SSS 
P75 _[RxFaVal__[RS—_[X___|___[ Equalization amount conol——SC—S 
pa RxTermen [RS __[X___[ | Receiver termination enable. SSCS 
[3 RxAlignEn [RS __[X___ |__| Receiver alignment enable _————*+Y 
pain RSX | Receiver enable control SSS 
PP RxPiPwron [RS __[X___ |__| PLL power state controk SS 
[0 HaliRate _[RS—_[X__ | __[Diital halFrate data contol SSCS 


13.14.118 Output Signals Status Register (Lane 4) 


Description 


Status of output signals Reset value depends on inputs 


Register 
R_PciePhyCrLane4OutStat 


Address 
0xE98112018 


Always reads a 

Transmit receiver detection result. 

|__| Transmit operation is complete output. 
2 R Loss 31 j 


p2 [tos J 
PoP Rava TRS] 
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13.14.119 Transmitter Control Inputs Override Register (Lane 4) 
Description 


Override of Transmitter control inputs 


Register 


R_PciePhyCrLane4TxOvrd 


Address 


0xE98112020 

fis__[Owd [RWS [0 |_| Phableovennde ofall bits m this regen —SSCSCSC—=*Y 
Cs | TxBdgerate [RW [0x0 |__| Bdgerate control ——SSSSS—SC—S~S 
F210 | TxAtten [RW 0x0_| | Attenuation amount controk——S—S—SCS 
[96 | TxBoost__ [RW [0x0 [| Boost amount control___——S—SCSCS 


pS [Reserved [RW [0 |_| Noefct —SSOSCSCSCSOSCSCSCC~*' 
Pa PTsCiklign [RW [0 |__| Command to align docks SSCS 
Pat [TxEn | RW_[ 03 |__| Transmit enable control SCS 
POP TxCkoEn [RW [Tt __[___[ Tccko clock enable. ——SCSCSSSCCC~*Y 


13.14.120 Receiver Control Inputs Override Register (Lane 4) 


Description 


Override of Receiver control inputs 


Register 
R_PciePhyCrLane4RxOvrd 


Address 
0xE98112028 


it _[owd [RWS [0 |__| Bnableovennide ofall bis nhs rater = 
sae [Lost RW [0st |__| E08 fiftermg mode control. ————SSSCSC~—S 
Pi [plese [RW [0 |_| DPLL reset controk__——SSCSCSSS 
[108 [RxDpiiMods [RW _[0xd__[—__[ DPLE. mode controh ——SSS—S—S—SCS—SCS*'Y 
RxEqVal__[RW__[0x0__[ | Baualization amount controk——S—SS 
PReTermEn [RW [1 |__| Receiver temmination enable. 

PRW [| ____[ Receiver alignment enable. SSS 


RxPlPwron | RW [1 | s|:- PLL power state control. 
[HaliRate [RW [0 [| Digital halfrate data control 


RW 
RW 
pa [Rin RW [| Receiver enable control SSCS 
RW 
Esa RW 


13.14.121 Output Signals Override Register (Lane 4) 
Description 


Override of output signals 


Register 
R_PciePhyCrLane4OutOvrd 
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Address 


0xE98112030 


[S_[Ond [RWS [0 |_| nableownide fall bis m this neato 
Pa [asRxpres [RW [1 |__| Transmit receiver detection result. 
[3_[TxDone [RW [0 [Transmit operation is complete output. ————————| 


[2 [hos RW [0s of signal output. 
PT _[RsPistate [RW [0 _[___| Current state of Rx PIL—SSOSOSOSOSCSCS 
PoP Rxvaia [RW [1 [Receiver valid output SSCS 


13.14.122 Debug Control Register (Lane 4) 


Description 


Debug control register 


Register 
R_PciePhyCrLane4DbgCtl 


Address 
OxE981 12038 


[Definition ——SCSC~—~—~—S—SCSCSd 


14:10 | DtbSell RW Select of wire to ae onto DTB bit 1 0 - disabled 1 - 
Pema PAY [ar Pent oom DITTO RTT 

DtbSel0 RW Select of wire to hs onto DTB bit 0 0 - disabled 1 - 
fee ie lal! ieee | 


a DisableRxCk | RW [0 |__| Disable neck output, ——S—SCSSSS 
3 [Invert [RW 0 |__| Invert receive data (preIbot). ——SOSOS—~—~S—SCS 
[2 [invert [RW [0 |__| Tnvert transmit data (post-lbert) SS 
PT | ZeroRxData_[ RW [0 |__| Overvide all receive data to zeros——SSS~S~SCS 
[0 | ZeroTxData [RW [0 |__| Overvide all transmit data to zeros SS 


13.14.123 Pattern Generator Controls Register (Lane 4) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrLane4PgCtl 


Address 
0xE98112080 


aa [ Pa [RW [00 | | PattmformodsssSSCSC~—~—SCSC~*d 
Ps _[ Tigger [RW [0 |__| Insert a single error mtoasb SSS 


| 2:0 | Mode =| RW | 0x0 =| ~_‘[{ Pattern to generate 0 - disabled 1 - lfsr15. 


13.14.124 Pattern Matcher Controls Register (Lane 4) 
Description 


Pattern Matcher controls 
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Register 
R_PciePhyCrLane4PmCtl 


Address 
OxE981 120C0 


[Definition —SCSCSCSCSCSCSCSCSCSCSC*Sd 


Syne RW ————— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode RW Pattern to match 0 - disabled 1 - lfsr15 2 - lfsr7 3 - d[{n 
= d{n-10] 4 - d{n| = 


13.14.125 Pattern Match Error Counter Register (Lane 4) 


Description 

Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 
Register 

R_PciePhyCrLane4PmErr 


Address 
0xE981120C8 


Attributes 


-noregtest 


fis [Ovid [RWS [X |_| active, multiply COUNT Oy PS SSCSCSC~* 


14:0 | Count RWS xX Current error count If OV14 field is active, then multiply 
count by 128. 


13.14.126 Current Phase Selector Value. Register (Lane 4) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrLane4P hase 


Address 
0xE981120D0 


Attributes 


-noregtest 


por [val RWS [0x0 | «(| Canent phase soector value ——SSSOSCSCSCS~S 
po [Dime RWS [0 Carrent phase selectorvalue 
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13.14.127 Current Frequency Integrator Value. Register (Lane 4) 
Description 


Current frequency integrator value. 


Register 


R_PciePhyCrLane4Freq 


Address 


0xE981120D8 


Attributes 


-noregtest 


Pisa [val RWS [0x0 |__| Curent frequency integrator value 


Po [Dine RWS [0 [__[ Current frequency integrator value 


13.14.128 Scope Control Register (Lane 4) 
Description 


Control bits for per-transceiver scope portion 


Register 


R_PciePhyCrLane4ScopeCtl 


Address 


0xE981120E0 


14:11 | Base === [RW [0x0 | |: Which bit to sample when MODE = 1. 
| 10:2 | Delay =| RW [0x0 =| ‘| Number of symbols to skip between samples. 


1:0 Mode RW 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.129 Recovered Domain Receiver Control Register (Lane 4) 


Description 


Control bits for receiver in recovered domain 


Register 


R_PciePhyCrLane4RxCtl 


Address 


0xE981120E8 


May 14, 2014 740 Rev 51328 


SiCortex Confidential 13.14. PCI EXPRESS PHY REGISTERS 


SwitchVal }RW [0 | _|[ Value to override the data/phase mux. 
OvrdSwitch Override the value of the data/phase mux. 
ModeBp RW 


12:10 [iia Set BP 2:0 to longer timescale (for FTS patterns) BPO - 


Start PHUG profile at 4/. 


38 [RugVahe [RW 00_[—__] Ovenide vale for RUGS 
P76 | PhngValne [RW _[00__[ | Overvide value for PHUG.————SSS—S—S——S 


OvrdDpllGain ;-RW [0 |. Override PHUG and FRUG values. 
PhdetPol |-RW [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge RW Edges to use for phase detection top bit is rising edges, 
bottom is falling. 


PhdetEn fee ee | Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.130 Receiver Debug Register (Lane 4) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrLane4RxDbg 


Address 
0xE981120F0 


DtbSel1 |RW | 0x0 =| ~~—_[ Select wire to go on DTB bit 1. 


DtbSel0 |RW [| 0x0 =| ~~ Select wire to go on DTB bit 0. 


13.14.131 RX Control Register (Lane 4) 
Description 


RX Control Bits 


Register 


R_PciePhyCrLane4RxAnaCtrl 


Address 
0xE98112180 


Attributes 


-noregtest 


Se a 

}4 | RxlbiEfn |RW [0 ~~ [|__| Digital serial (internal) loopback enable bit. 

[3 [Reba | RW—| 0 || Wafer level (external) loopback enable bit. 

AOE [0 ek 95 enable bie 
a 
POF nate 
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13.14.1832 RX ATB Register (Lane 4) 
Description 


RX ATB bits 


Register 
R_PciePhyCrLane4RxAnaAtb 


Address 
0xE98112188 


Attributes 


-noregtest 
[5 [SensemVrefLos [RW [0 [|__| Connect atb_sm to vrel_los (vrelrx/14). 


Pa [SensemVem [RW [0 |__| Connect atbamtoRXvon. SSS 
[3 _[SensemRxM [RW [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [RW [0 |__| Connect atbaptomep. —SSS—S—SC—SCSCSCSC—S 
PT | ForeopRsMt [RW [0 [|__| Connect atb-E-p to nom. 
PO | ForcepRxP [RW [0 |__| Connect atb-tptonep. SSS 


13.14.1338 Bit Programming Register (Lane 4) 
Description 


8 bit programming register 


Register 
R_PciePhyCrLane4PllPrg2 


Address 
0xE98112190 


Attributes 


OPES LEE! 


[Denton ——s—C—~“*S*S*SC“‘“S*S*S*SCS*S 
AtbSenseSel RW Control of ——— charge pump current 1=Enable 
signals internal to the PLL. 
FrceHcpl RW Allow override of default value of hep] 1=allow hcpl_lcl to 
ae high-couplin. 


aa HeplLcl ee | 1=force coupling in vco to maximum. ssid 


FrcPwron |_| ie override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | Pwronkcl | 10 =| ‘| 1=power is supplied to the PLL. 


Se ocr Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


[Reeth RW [0 [___[ T= PEL is held placed in reset 


P| EnableTestPd pee 1=phase linearity of phase interpolator and VCO is being 
tested. 


May 14, 2014 742 Rev 51328 


SiCortex Confidential 13.14. PCI EXPRESS PHY REGISTERS 


13.14.134 10 Bit Programming Register (Lane 4) 
Description 


10 bit programming register 


Register 
R_PciePhyCrLane4PlPrgl 


Address 
0xE98112198 


Attributes 


-noregtest 


Po [Unset [RW [1 |__| Unusoa 


SelRxck }RW [0 =| ~__[ Use recovered clock as reference to the PLL. 
7:5 | PropCntrl RW Ox5 rol Control of Proportional charge pump current Propor- 


tional current = (n+1)/8*fulL. 


4:2 | IntCntrl RW 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*full_scale De. 


D0 [Unsed RW ost sed SSOSCSCSOSOSCSCSCSCSCSCSC*S 


13.14.135 10 Bit Programming Register (Lane 4) 


Description 


10 bit programming register 


Register 
R_PciePhyCrLane4PllMeas 


Address 
0xE981121A0 


Mnemonié 
MeasBias 
MeasVcntrl Measure ventrl on atb_sense_m If MEAS_VREF is set as 
well, atb_sense_p,m mea- su. 

Meas Vref Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 

Measure vp16 on atb_sense_p; gd on atb_sense_m. 
MESSIAH Measure startup voltage on atb_sense_p; gd on 
atb_sense_m. 


ka b b b 
Ss 

oO 

& ‘ ‘ “ 
vy A D1 DH 
< J 
so} 

aad 2 
aD 


Measure vco supply voltage on atb_sense_p; gd on 

atb_sense_m. 

MeasVpCp Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 

If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 

MEAS_VP_CP is set as well, atb_sense. 

MeasCrowbar Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


= = 
oO oO 
je) fev) 
ae Z 
< = 

(oe) 


oe 
a 
}6 |} 
5 

oes 
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13.14.136 TX ATB Control Register (Set 1) (Lane 4) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane4TxAnaAtbsell 


Address 


0xE981121A8 


Attributes 


-noregtest 


VbpfSP RW Vbpf in edge rate control circuit on ATB_S_P Set 
cll lt ATBLEN to take this well 

-e_ | Benswr | 
S| tent? [RW [0 [Tr connected fo ATES? Por terme J 


al a Txp connected to ATB. S_P Set ATB- EN to make this 
useful. 


eee ae Soe ee 
PT [wretsP__ [RW fo id ewe OSOC—SOSSOOCCCC“‘CSC*S 
ro [vesP___[Rw foi Re SSCSC—SSCSCC 


13.14.137 TX ATB Control Register (Set 2) (Lane 4) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane4TxAnaAtbsel2 


Address 


0xE981121B0 


Attributes 
-noregtest 
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[Deinition SSCS 


AtbEn RW Connect internal and external ee busses Needed for all 
ATB measurements. 
ies = We Renes <5". 0 Se et oP | 


o_ [Weiss RW] 


Peas PR Vi ER ATEN TT Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM RW a aie in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 

VbpssP RW Vbps in edge rate control circuit on ATB_S_M Set 
oad ATB_EN to make this useful. 

VbnfiSM RW Vbnf in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


Enlpbk Rian ee Enable TX external loopback Make sure internal loopback 
is not ON 


[OP eataipbk [RW [0 [| Bable TX taternal Toopback—— 


13.14.1388 TX POWER STATE Control Register (Lane 4) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrLane4TxAnaControl 


Address 
0xE981121B8 


Attributes 


enORCB Ee 


[Definition ——SSCSCSCSCSCSCSCSCSCSCSCS*S 


FrcPwrst RW ——————— force power state tx_en<1:0> input overridden by 
EN_LCL. 


EnLcl Locally force tx_en<1:0> 00 - power off 01 - tx idle (slow) 
10 - transmit data 1. 


FrcDo RW Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


aa DataovrdLcl aa ae Local dataovrd control value Set FRC_DO to make this 
useful. 


FrcBeacon Force Beacon to local value (BCN_LCL) When On, 
BCN_LVL overrides input an 


1 BenLcl RW Local Beacon On/Off Control Value Set FRC_BEACON 
to make this useful. 


PO [Umsed RW [Oi Umsedreg SSCS 


13.14.139 Transmit Control Inputs Status Register (Lane 5) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane5TxStat 
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Address 


OxE98112808 

fis [Reovedi [RS [x | __[AnaoradsestOSCSCSCSCSCS~*d 
PTaTs | TxFdgerate [RS [|X| | Bagerate contro SSS 
2:10 [TxAtien | RS__[X |_| Attenuation amount control SSCS 
36 | TxBoost_[RS__[X |__| Boost amount control____—SSSS—S—S—S— 


PS [Reserved _[RS__[X |__| Always reads as. ———SOSOSCSCSCSSCSC—S 
PoP TSCikAlign [RS [|X| | Command to align docks SSCS 
Pat_[TxEn [RSX | | Transmit enable control SSS 
POP TxCkoEn [RSX [|__| Tccko clock enable. ———SCSSSSCCCC~*' 


13.14.140 Receiver Control Inputs Status Register (Lane 5) 


Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane5RxStat 


Address 
0xE98112810 


Pid [Reened [RS_[X |_| AlwaysreadsasSSOSOSC~CSCSCSCSCSCSCS~*d 
P1sTz [Tos RS__[X__|__[ 108 filtermg mode control SSCS 
pir [DpliReset_— [RSX [Pi reset contro ——SSSSSCSCSCSCSCSCSCSCC—CSY 
[108 [RxDpliMode [RS—_[X | ___[ DPLE. mode controk SSS 
P75 _[RxFaVal__[RS—_[X___|___[ Equalization amount conol——SC—S 
pa RxTermen [RS __[X___[ | Receiver termination enable. SSCS 
[3 RxAlignEn [RS __[X___ |__| Receiver alignment enable _————*+Y 
pain RSX | Receiver enable control SSS 
PP RxPIPwron [RS___[X___ |__| PLL power state controk_———SS—S—S— 
Po [HaliRate [RS [X___[__[Diital halFrate data contol SSCS 


13.14.141 Output Signals Status Register (Lane 5) 


Description 


Status of output signals Reset value depends on inputs 


Register 
R_PciePhyCrLane5OutStat 


Address 
0xE98112818 


Always reads a 

Transmit receiver detection result. 

|__| Transmit operation is complete output. 
2 R Loss 31 j 


p2 [tos J 
PoP Rava TRS] 


May 14, 2014 746 Rev 51328 


Current state of Rx PLL 


Ee 
a a 

PRS [X_[ [Toss of signal outputs SSSCSC~S 
eT 
PX [Receiver valid output SSS 


SiCortex Confidential 13.14. PCI EXPRESS PHY REGISTERS 


13.14.142 Transmitter Control Inputs Override Register (Lane 5) 
Description 


Override of Transmitter control inputs 


Register 
R_PciePhyCrLane5TxOvrd 


Address 


0xE98112820 

fis_[Owd [RWS [0 |_| Phableovensde ofall bits m this regen ———SSS—=*Y 
Cs | TxBdgerate [RW [0x0 |__| Bdgerate control ——SSSSS—S~—~S 
F210 | TxAtten [RW 0x0_| | Attenuation amount controk——S—S—SCS 
[96 | TxBoost__ [RW [0x0 [| Boost amount control___——S—SCSCS 


pS [Reserved [RW [0 |_| Noefct —SSOSCSCSCSOSCSCSCC~*' 
Pa PTsCiklign [RW [0 |__| Command to align docks SSCS 
Pat [TxEn [RW [03 |__| Transmit enable control SSCS 
POP TxCkoEn [RW [T [|__| Tccko clock enable. SCS 


13.14.143 Receiver Control Inputs Override Register (Lane 5) 


Description 


Override of Receiver control inputs 


Register 
R_PciePhyCrLane5RxOvrd 


Address 
0xE98112828 


it [Owd | RWS[0 |__| Bnablo ovennide ofall bis nhs reatern = 
iste [Lost RW [0st |__| E08 fiftermg mode control ———SSSOSC~—S 
Pi [plese [RW [0 |_| DPLL reset controk__——SSCSCSSS 
[108 [RxDpiiMods [RW _[0xd__[—__[ DPLE. mode controh ——SSS—S—S—SCS—SCS*'Y 
RxEqVal__[RW__[0x0__[ | Baualization amount controk——S—SS 
PReTermEn [RW [1 |__| Receiver temmination enable. 

PRW [| ____[ Receiver alignment enable. SSS 


RxPlPwron | RW [1 | s|:- PLL power state control. 
[HaliRate [RW [0 [| Digital halfrate data control 


RW 
RW 
pa [Rin RW [| Receiver enable control SSCS 
RW 
re RW 


13.14.144 Output Signals Override Register (Lane 5) 
Description 


Override of output signals 


Register 
R_PciePhyCrLane5OutOvrd 
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Address 


0xE98112830 


S_[Ond [RWS [0 |__| Pnableownide all bis m this neato 
Pa [asRxpres [RW [1 |__| Transmit receiver detection result. 
[3 [TxDone [RW [0 [Transmit operation is complete output. ————————| 


[2 [hos RW [0 oss of signal output. 
PT _[RsPistate_[ RW [0 _[__| Current state of Rx PIL——SSOSOSOSCSOSCSCS 
Po [Rxvalia RW [1 [Receiver valid output SSS 


13.14.145 Debug Control Register (Lane 5) 


Description 


Debug control register 


Register 
R_PciePhyCrLane5DbgCtl 


Address 
OxE981 12838 


[Definition ——SCSC~—~—~—S—SCSCSd 


14:10 | DtbSell RW Select of wire to ae onto DTB bit 1 0 - disabled 1 - 
Pema PAY [ar Pent oo OTT TTT 

DtbSel0 RW Select of wire to hs onto DTB bit 0 0 - disabled 1 - 
fee ie lal! ieee | 


a DisableRxCk | RW [0 |__| Disable neck output, ——S—SCSSSS 
3 [Invert [RW 0 |__| Invert receive data (preIbot). ——SOSOS—~—~S—SCS 
[2 [invert [RW [0 |__| Tnvert transmit data (post-lbert) SS 
PT | ZeroRxData_[ RW [0 |__| Overvide all receive data to zeros——SSS~S~SCS 
PO | ZeroTxData [RW [0 |__| Overvide all transmit data to zeros SSS 


13.14.146 Pattern Generator Controls Register (Lane 5) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrLane5PgCtl 


Address 
0xE98112880 


aa [Pa [RW [00 | | PattmformodsssSSOSC~—~—SCSC~*d 
Ps [ Tigger [RW [0 |__| Insert a single error mtoasb—SSS—S—~SCS 


| 2:0 | Mode =| RW | 0x0 =| ~—_—i[{ Pattern to generate 0 - disabled 1 - lfsr15. 


13.14.147 Pattern Matcher Controls Register (Lane 5) 
Description 


Pattern Matcher controls 
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Register 
R_PciePhyCrLane5PmCtl 


Address 
OxE981 128C0 


[Definition —SCSCSCSCSCSCSCSCSCSCSC*Sd 


Syne RW ————— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode RW Pattern to match 0 - disabled 1 - lfsr15 2 - lfsr7 3 - d[{n 
= d{n-10] 4 - d{n| = 


13.14.148 Pattern Match Error Counter Register (Lane 5) 


Description 

Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 
Register 

R_PciePhyCrLane5PmErr 


Address 
0xE981128C8 


Attributes 


-noregtest 


fis [Ovid [RWS [X |_| active, multiply COUNT Oy PS SSCSCSC~* 


14:0 | Count RWS xX Current error count If OV14 field is active, then multiply 
count by 128. 


13.14.149 Current Phase Selector Value. Register (Lane 5) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrLane5Phase 


Address 
0xE981128D0 


Attributes 


-noregtest 


ior [val RWS [0x0 | «(| Cann phase sdlector value ——SSSSCSCSCSCSCS 
po [Dime RWS [0 Canrent phase selectorvalue 
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13.14.150 Current Frequency Integrator Value. Register (Lane 5) 
Description 


Current frequency integrator value. 


Register 


R_PciePhyCrLane5Freq 


Address 


0xE981128D8 


Attributes 


-noregtest 


Pisa [val RWS [0x0 |__| Curent frequency integrator value 


Po [Dine RWS [0 [__[ Current frequency integrator value 


13.14.151 Scope Control Register (Lane 5) 
Description 


Control bits for per-transceiver scope portion 


Register 


R_PciePhyCrLane5ScopeCtl 


Address 


0xE981128E0 


14:11 | Base = = [RW [0x0 | ~—__—|: Which bit to sample when MODE = 1. 
| 10:2 | Delay =| RW [0x0 =| ‘| Number of symbols to skip between samples. 


1:0 Mode RW 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.152 Recovered Domain Receiver Control Register (Lane 5) 


Description 


Control bits for receiver in recovered domain 


Register 


R_PciePhyCrLane5RxCtl 


Address 


0xE981128E8 
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SwitchVal }RW [0 | _|[ Value to override the data/phase mux. 
OvrdSwitch Override the value of the data/phase mux. 
ModeBp RW 


12:10 [iia Set BP 2:0 to longer timescale (for FTS patterns) BPO - 


Start PHUG profile at 4/. 


38 [RugVahe [RW 00_[—__] Ovenide vale for RUGS 
P76 | PhngValne [RW _[00__[ | Overvide value for PHUG.————SSS—S—S——S 


OvrdDpllGain ;-RW [0 |. Override PHUG and FRUG values. 
PhdetPol |-RW [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge RW Edges to use for phase detection top bit is rising edges, 
bottom is falling. 


PhdetEn fee ee | Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.153 Receiver Debug Register (Lane 5) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrLane5RxDbg 


Address 
0xE981128F0 


DtbSel1 |RW | 0x0 =| ~~—_[ Select wire to go on DTB bit 1. 


DtbSel0 |RW [| 0x0 =| ~———[ Select wire to go on DTB bit 0. 


13.14.154 RX Control Register (Lane 5) 
Description 


RX Control Bits 


Register 
R_PciePhyCrLane5RxAnaCtrl 


Address 
0xE98112980 


Attributes 


-noregtest 


Se a 

}4 | RxlbiEfn |RW [0 ~~ [|__| Digital serial (internal) loopback enable bit. 

[3 [Reba | RW—| 0 || Wafer level (external) loopback enable bit. 

AOE [0 ek 95 enable bie 
a 
POF nate 
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13.14.155 RX ATB Register (Lane 5) 
Description 


RX ATB bits 


Register 
R_PciePhyCrLane5RxAnaAtb 


Address 
0xE98112988 


Attributes 


-noregtest 
[5 [SensemVrefLos [RW [0 [|__| Connect atb_sm to vrel_los (vrelrx/14). 


Pa [SensemVem [RW [0 |__| Connect atbamtoRXvon. SSS 
[3 _[SensemRxM [RW [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [RW [0 |__| Connect atbaptomep. —SSS—S—SC—SCSCSCSC—S 
PT | ForeopRsMt [RW [0 [|__| Connect atb-Ep to nom. 
-O_[ ForcepRxP [RW [0 |__| Connect atb-tptonep. SSS 


13.14.156 8 Bit Programming Register (Lane 5) 
Description 


8 bit programming register 


Register 
R_PciePhyCrLane5PllPrg2 


Address 
0xE98112990 


Attributes 


OPES LEE! 


[Denton ——s—C—~“*S*S*SC“‘“S*S*S*SCS*S 
AtbSenseSel RW Control of ——— charge pump current 1=Enable 
signals internal to the PLL. 
FrceHcpl RW Allow override of default value of hep] 1=allow hcpl_lcl to 
ae high-couplin. 


aa HeplLcl ee | 1=force coupling in vco to maximum. ssid 


FrcPwron |_| ie override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | Pwronkcl | 10 =| ‘| 1=power is supplied to the PLL. 


Se ocr Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


[Reeth RW [0 [___[ T= PEL is held placed in reset 


P| EnableTestPd pee 1=phase linearity of phase interpolator and VCO is being 
tested. 
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13.14.157 10 Bit Programming Register (Lane 5) 
Description 


10 bit programming register 


Register 
R_PciePhyCrLane5PlPrgl 


Address 
0xE98112998 


Attributes 


-noregtest 


Po [Unset [RW [1 |__| Unusoa 


SelRxck }RW [0 =| ~__[ Use recovered clock as reference to the PLL. 
7:5 | PropCntrl RW Ox5 rol Control of Proportional charge pump current Propor- 


tional current = (n+1)/8*fulL. 


4:2 | IntCntrl RW 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*full_scale De. 


D0 [Unsed RW ost sed SSOSCSCSOSOSSOSCSCSCSCSC*S 


13.14.158 10 Bit Programming Register (Lane 5) 


Description 


10 bit programming register 


Register 
R_PciePhyCrLane5PllMeas 


Address 
0xE981129A0 


Mnemonié 
Measure copy of bias current in oscillator on atb_force_m. 


MeasVcntrl Measure ventrl on atb_sense_m If MEAS_VREF is set as 


2 
<= 
oO 
© 
iy 
w 
hs 
© 
a 


well, atb_sense_p,m mea- su. 

MeasVref Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 

Measure vp16 on atb_sense_p; gd on atb_sense_m. 
MESSIAH Measure startup voltage on atb_sense_p; gd on 
atb_sense_m. 


< 
— 
oO 
© 
n 
< 
ue) 
eH 
for) 


Measure vco supply voltage on atb_sense_p; gd on 

atb_sense_m. 

MeasVpCp Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 

If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 

MEAS_VP_CP is set as well, atb_sense. 

MeasCrowbar Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


= = 
oO oO 
je) fev) 
ae Z 
< = 

(oe) 


9 |} 
Lali 
}6 |} 
5 
oes 
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13.14.159 TX ATB Control Register (Set 1) (Lane 5) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane5TxAnaAtbsell 


Address 


0xE981129A8 


Attributes 


-noregtest 


VbpfSP RW Vbpf in edge rate control circuit on ATB_S_P Set 
cll lt ATBLEN to take this well 

-e_ | Benswr | 
S| tent? [RW [0 [Tr connected fo ATES? Por terme J 


al a Txp connected to ATB. S_P Set ATB- EN to make this 
useful. 


eee ae Soe ee 
PT [wretsP__ [RW fo id ewe OSOC—SOSSOOCCCC“‘CSC*S 
ro [vesP___[Rw foi Re SSCSC—SSCSCC 


13.14.160 TX ATB Control Register (Set 2) (Lane 5) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane5TxAnaAtbsel2 


Address 


0xE981129B0 


Attributes 
-noregtest 
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[Deinition SSCS 


AtbEn RW Connect internal and external ee busses Needed for all 
ATB measurements. 
ies = We Renes <5". 0 Se et oP | 


o_ [Weiss RW] 


Peas PR Vi ER ATEN TT Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM RW a aie in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 

VbpssP RW Vbps in edge rate control circuit on ATB_S_M Set 
oad ATB_EN to make this useful. 

VbnfiSM RW Vbnf in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


Enlpbk Rian ee Enable TX external loopback Make sure internal loopback 
is not ON 


[OP eatspbk [RW [0 [| Bable TX taternal Toopback— 


13.14.161 TX POWER STATE Control Register (Lane 5) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrLane5TxAnaControl 


Address 
0xE981129B8 


Attributes 


enORCB Ee 


[Definition ——SSCSCSCSCSCSCSCSCSCSCSCS*S 


FrcPwrst RW ——————— force power state tx_en<1:0> input overridden by 
EN_LCL. 


EnLcl Locally force tx_en<1:0> 00 - power off 01 - tx idle (slow) 
10 - transmit data 1. 


FrcDo RW Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


hal DataovrdLcl ae el Local dataovrd control value Set FRC_DO to make this 
useful. 


FrcBeacon Force Beacon to local value (BCN_LCL) When On, 
BCN_LVL overrides input an 


1 BenLcl RW Local Beacon On/Off Control Value Set FRC_BEACON 
to make this useful. 


PO [Unset RW [Oi Umsedreg SSCS 


13.14.162 Transmit Control Inputs Status Register (Lane 6) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane6TxStat 
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Address 


0xE98113008 

fis_[Reovedt [RS [xX | [AhaoradsestOOSCSC~—CSCSCS~*Y 
PTETs | TxFdgerate [RS [|X| | Bagerate contro SS 
2:10 [TxAtien | RS__[X |_| Attenuation amount control SSCS 
36 | TxBoost_[RS__[X |__| Boost amount control____—SSSS—S—S—S— 


PS [Reserved _[RS__[X |__| Always reads as. ———SOSOSCSCSCSSCSC—S 
PoP TSCikAlign [RS [|X| | Command to align docks SSCS 
pat_[TxEn | RS__[ | | Transmit enable control SS 
POP TxCkoFn [RSX [|__| Tccko clock enable. ———S—SCSSSC~* 


13.14.163 Receiver Control Inputs Status Register (Lane 6) 
Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane6RxStat 


Address 
0xE98113010 


pig [Reseed [RS [X |__| AlwaysreadsasSSOSCSCSCSCSCSCSCSCS~*d 
P1sTz [Tos RS [X__|__[ £08 filtermg mode control SSCS 
pir [DpliReset_— [RSX [Pi reset contro ——SSSSSCSCSCSCSCSCSCSCC—CSY 
[108 [RxDpliMode [RS—_[X | ___[ DPLE. mode controk SSS 
P75 _[RxFaVal__[RS—_[X___|___[ Equalization amount conol——SC—S 
pa RxTermen [RS __[X___[ | Receiver termination enable. SSCS 
[3 RxAlignEn [RS __[X___ |__| Receiver alignment enable _————*+Y 
pain RSX | Receiver enable control SSS 
PP RxPIPwron [RS___[X___ |__| PLL power state controk_———SS—S—S—S—S 
Po [HaliRate [RS [X___[__[Dieital halFrate data contol SSCS 


13.14.164 Output Signals Status Register (Lane 6) 


Description 


Status of output signals Reset value depends on inputs 


Register 
R_PciePhyCrLane6OutStat 


Address 
0xE98113018 


Always reads a 

Transmit receiver detection result. 

|__| Transmit operation is complete output. 
2 R Loss 31 j 


p2 [tos J 
PoP Rava TRS] 
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Current state of Rx PLL 


Ee 
a a 

PRS PX | __[Tossof signal output SSSCSCS~S 
eT 
PX [Receiver valid output SSS 
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13.14.165 Transmitter Control Inputs Override Register (Lane 6) 
Description 


Override of Transmitter control inputs 


Register 
R_PciePhyCrLane6TxOvrd 


Address 


0xE98113020 

fis__[Owd [RWS [0 |_| Phableovennide ofall bits m this regtor—SSSCSCS—=*Y 
CTs | TxBdgerate [RW [0x0 |_| Bdgerate control ——SSSSS—S—S~—S~S 
P20 | TxAtten [RW | 0x0_| | Attenuation amount controk——S—S—SCS 
[9% | TxBoost_ [RW | 0x0_[ | Boost amount control___——S—S—S—SCSCS 


Ps [Reserved [RW [0 |_| NoeictSSSCSCSCSOSCSCCCC~C~*' 
Pf PTsCikAlign [RW [0 |__| Command to align cooks SSCS 
Pat [TxEn [RW [03 |__| Transmit enable control SSS 
POP TCkoEn [RW [Tt __ |__| Tccko clock enable. ———SCSCSSSSCCCC~*? 


13.14.166 Receiver Control Inputs Override Register (Lane 6) 


Description 


Override of Receiver control inputs 


Register 
R_PciePhyCrLane6RxOvrd 


Address 
0xE98113028 


it _[Owd | RWS[0 |__| Bnabloovennide ofall bis mn this reanten = 
ise [Tost RW [0st |__| E08 fiftermg mode control ———SSOSOSC~—S 
Pi [plese [RW [0 |_| DPLL reset controk__——SSCSCSSS 
[108 [RxDpiiMods [RW _[0xd__[—__[ DPLE. mode controh ——SSS—S—S—SCS—SCS*'Y 
RxEqVal__[RW__[0x0__[ | Baualization amount controk——S—SS 
PReTermEn [RW [1 |__| Receiver temmination enable. 

PRW [| ____[ Receiver alignment enable. SSS 


RxPlPwron | RW [1 | s|:- PLL power state control. 
[HaliRate [RW [0 [| Digital halfrate data control 


RW 
RW 
pa [Rin RW [| Receiver enable control SSCS 
RW 
aa RW 


13.14.167 Output Signals Override Register (Lane 6) 
Description 


Override of output signals 


Register 
R_PciePhyCrLane6OutOvrd 
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Address 


0xE98113030 


[S_[Ond [RWS [0 |__| Pableownide fall bis mths neato 
Pa [isRxpres [RW [1 |__| Transmit receiver detection result. 
[3 [TxDone [RW [0 [Transmit operation is complete output. ————————| 


[2 [hos RW [0 oss of signal output. 
PT _[RsPistate_[ RW [0 _[__| Current state of Rx PISS 
Po [rxvalia RW [Receiver valid output SSS 


13.14.168 Debug Control Register (Lane 6) 


Description 


Debug control register 


Register 
R_PciePhyCrLane6DbgCtl 


Address 
OxE981 13038 


[Definition ——SCSC~—~—~—S—SCSCSd 


14:10 | DtbSell RW Select of wire to ae onto DTB bit 1 0 - disabled 1 - 
Peas PAV [rp Pent oo DOT TTTT 

DtbSel0 RW Select of wire to hs onto DTB bit 0 0 - disabled 1 - 
fee ie lal! ieee | 


a DisableRxCk | RW [0 |__| Disable neck output, ——S—SCSSSS 
3 [Invert [RW 0 |__| Invert receive data (preIbot). ——SOSOS—~—~S—SCS 
[2 [invert [RW [0 |__| Tnvert transmit data (post-lbert) SS 
PT | ZeroRxData_[ RW [0 |__| Overvide all receive data to zeros—SSS~S~SCS 
PO _[ZeroTxData [RW [0 |__| Overnide all transmit data to zeros SS 


13.14.169 Pattern Generator Controls Register (Lane 6) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrLane6PgCtl 


Address 
0xE98113080 


aa [Pa [RW [00 | | PattmformodsssSOSC~—~SCSC~*d 
Ps _[ Tigger [RW [0 |__| Insert a single error mtoasb—SSOS—S—SS 


| 2:0 | Mode =| RW | 0x0 =| ~—__—s{ _ Pattern to generate 0 - disabled 1 - lfsr15. 


13.14.170 Pattern Matcher Controls Register (Lane 6) 
Description 


Pattern Matcher controls 


May 14, 2014 758 Rev 51328 


SiCortex Confidential 13.14. PCI EXPRESS PHY REGISTERS 


Register 
R_PciePhyCrLane6PmCtl 


Address 
OxE981 130C0 


[Definition —SCSCSCSCSCSCSCSCSCSCSC*Sd 


Syne RW ————— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode RW Pattern to match 0 - disabled 1 - lfsr15 2 - lfsr7 3 - d[{n 
= d{n-10] 4 - d{n| = 


13.14.171 Pattern Match Error Counter Register (Lane 6) 


Description 

Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 
Register 

R_PciePhyCrLane6PmErr 


Address 
0xE981130C8 


Attributes 


-noregtest 


fis [Ovid [RWS [X |_| active, multiply COUNT Oy PS SSCSCSC~* 


14:0 | Count RWS xX Current error count If OV14 field is active, then multiply 
count by 128. 


13.14.172 Current Phase Selector Value. Register (Lane 6) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrLane6Phase 


Address 
0xE981130D0 


Attributes 


-noregtest 


por [val RWS [0x0 | «(| Canent phase soector value ——SSSOSCSCSCS~S 
po [Dime RWS [0 Carrent phase sclectorvalue 
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13.14.173 Current Frequency Integrator Value. Register (Lane 6) 
Description 


Current frequency integrator value. 


Register 


R_PciePhyCrLane6Freq 


Address 


0xE981130D8 


Attributes 


-noregtest 


Pisa [val RWS [0x0 |__| Current frequency integrator value 


Po [Dine RWS [0 [ Current frequency integrator value 


13.14.174 Scope Control Register (Lane 6) 
Description 


Control bits for per-transceiver scope portion 


Register 


R_PciePhyCrLane6ScopeCtl 


Address 


0xE981130E0 


14:11 | Base == = [RW [0x0 =| ‘|: Which bit to sample when MODE = 1. 
| 10:2 | Delay =| RW [0x0 =| ‘| Number of symbols to skip between samples. 


1:0 Mode RW 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.175 Recovered Domain Receiver Control Register (Lane 6) 


Description 


Control bits for receiver in recovered domain 


Register 


R_PciePhyCrLane6RxCtl 


Address 


0xE981130E8 
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SwitchVal }RW [0 | _|[ Value to override the data/phase mux. 
OvrdSwitch Override the value of the data/phase mux. 
ModeBp RW 


12:10 [iia Set BP 2:0 to longer timescale (for FTS patterns) BPO - 


Start PHUG profile at 4/. 


38 [RugVahe [RW 00_[—__] Ovenide vale for RUGS 
P76 | PhngValne [RW _[00__[ | Overvide value for PHUG.————SSS—S—S——S 


OvrdDpllGain ;-RW [0 |. Override PHUG and FRUG values. 
PhdetPol |-RW [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge RW Edges to use for phase detection top bit is rising edges, 
bottom is falling. 


PhdetEn fee ee | Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.176 Receiver Debug Register (Lane 6) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrLane6RxDbg 


Address 
0xE981130F0 


DtbSel1 |RW | 0x0 =| ~~—_[ Select wire to go on DTB bit 1. 


DtbSel0 |RW [0x0 =| ~————[ Select wire to go on DTB bit 0. 


13.14.177 RX Control Register (Lane 6) 


Description 


RX Control Bits 


Register 
R_PciePhyCrLane6RxAnaCtrl 


Address 
0xE98113180 


Attributes 


-noregtest 


Se a 

}4 | RxlbiEfn |RW [0 ~~ [|__| Digital serial (internal) loopback enable bit. 

[3 [Reba | RW—| 0 || Wafer level (external) loopback enable bit. 

AOE [0 ek 95 enable bie 
a 
POF nate 


May 14, 2014 761 Rev 51328 


[of Aten RW] 


SiCortex Confidential CHAPTER 13. PCI EXPRESS SUBSYSTEM 


13.14.178 RX ATB Register (Lane 6) 


Description 


RX ATB bits 


Register 
R_PciePhyCrLane6RxAnaAtb 


Address 
0xE98113188 


Attributes 


-noregtest 
[5 [SensemVrefLos [RW [0 [|__| Connect atb_sm to vrel_los (vrelrx/14). 


Pa [SensemVem [RW [0 |__| Connect atbamtoRXvon. SSS 
[3 _[SensemRxM [RW [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [RW [0 |__| Connect atbaptomep. —SSS—S—SC—SCSCSCSC—S 
PT | ForeopRsMt [RW [0 [|__| Connect atb-Ep to nom. 
PO | ForcepRxP [RW [0 |__| Connect atb-ptonep. SSS 


13.14.179 8 Bit Programming Register (Lane 6) 
Description 


8 bit programming register 


Register 
R_PciePhyCrLane6PllPrg2 


Address 
0xE98113190 


Attributes 


OPES LEE! 


[Denton ——s—C—~“*S*S*SC“‘“S*S*S*SCS*S 
AtbSenseSel RW Control of ——— charge pump current 1=Enable 
signals internal to the PLL. 
FrceHcpl RW Allow override of default value of hep] 1=allow hcpl_lcl to 
ae high-couplin. 


aa HeplLcl ee | 1=force coupling in vco to maximum. ssid 


FrcPwron |_| ie override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | Pwronkcl | 10 =| ‘| 1=power is supplied to the PLL. 


Se ocr Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


[Reeth RW [0 [___[ T= PEL is held placed in reset 


P| EnableTestPd pe ie 1=phase linearity of phase interpolator and VCO is being 
tested. 
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13.14.180 10 Bit Programming Register (Lane 6) 
Description 


10 bit programming register 


Register 
R_PciePhyCrLane6PlPrgl 


Address 
0xE98113198 


Attributes 


-noregtest 


Po [Unset [RW [1 |__| Unusoa 


SelRxck }RW [0 =| ~__[ Use recovered clock as reference to the PLL. 
7:5 | PropCntrl RW Ox5 rol Control of Proportional charge pump current Propor- 


tional current = (n+1)/8*fulL. 


4:2 | IntCntrl RW 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*full_scale De. 


PEO [Unsed RW ost sed SOSOSOSCSOSOSCSCSCSCSCSC*S 


13.14.181 10 Bit Programming Register (Lane 6) 


Description 


10 bit programming register 


Register 
R_PciePhyCrLane6PllMeas 


Address 
0xE981131A0 


Mnemonié 
Measure copy of bias current in oscillator on atb_force_m. 


MeasVcntrl Measure ventrl on atb_sense_m If MEAS_VREF is set as 


< 
S 
0) 
© 
mn 
w 
ue 
» 
mn 


well, atb_sense_p,m mea- su. 

MeasVref Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 

Measure vp16 on atb_sense_p; gd on atb_sense_m. 
MESSIAH Measure startup voltage on atb_sense_p; gd on 
atb_sense_m. 


< 
— 
oO 
© 
n 
< 
ue) 
eH 
for) 


Measure vco supply voltage on atb_sense_p; gd on 

atb_sense_m. 

MeasVpCp Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 

If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 

MEAS_VP_CP is set as well, atb_sense. 

MeasCrowbar Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


= = 
oO oO 
je) fev) 
ae Z 
< = 

(oe) 


9 |} 
Lali 
}6 |} 
5 
oes 
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13.14.182 TX ATB Control Register (Set 1) (Lane 6) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane6TxAnaAtbsell 


Address 


0xE981131A8 


Attributes 


-noregtest 


VbpfSP RW Vbpf in edge rate control circuit on ATB_S_P Set 
cll lt ATBLEN to take this well 

-e_ | Benswr | 
S| tent? [RW [0 [Tr connected fo ATES? Por terme J 


al a Txp connected to ATB. S_P Set ATB- EN to make this 
useful. 


eee ae Soe ee 
PT [wretsP__ [RW fo id ewe OSOC—SOSSOOCCCC“‘CSC*S 
ro [vesP___[Rw foi Re SSCSC—SSCSCC 


13.14.183 TX ATB Control Register (Set 2) (Lane 6) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane6TxAnaAtbsel2 


Address 


0xE981131B0 


Attributes 
-noregtest 
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[Deinition SSCS 


AtbEn RW Connect internal and external ee busses Needed for all 
ATB measurements. 
ies = We Renes <5". 0 Se et oP | 


o_ [Weiss RW] 


Peas PR Vi ER ATEN TT Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM RW a aie in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 

VbpssP RW Vbps in edge rate control circuit on ATB_S_M Set 
oad ATB_EN to make this useful. 

VbnfiSM RW Vbnf in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


Enlpbk Rian ee Enable TX external loopback Make sure internal loopback 
is not ON 


[OP eatsaipbk [RW [0 |__| Bable TX taternal Toopback—— 


13.14.184 TX POWER STATE Control Register (Lane 6) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrLane6TxAnaControl 


Address 
0xE981131B8 


Attributes 


enORCB Ee 


[Definition ——SSCSCSCSCSCSCSCSCSCSCSCS*S 


FrcPwrst RW ——————— force power state tx_en<1:0> input overridden by 
EN_LCL. 


EnLcl Locally force tx_en<1:0> 00 - power off 01 - tx idle (slow) 
10 - transmit data 1. 


FrcDo RW Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


hal DataovrdLcl ae el Local dataovrd control value Set FRC_DO to make this 
useful. 


FrcBeacon Force Beacon to local value (BCN_LCL) When On, 
BCN_LVL overrides input an 


1 BenLcl RW Local Beacon On/Off Control Value Set FRC_BEACON 
to make this useful. 


PO [Unused RW [Oi Umsedreg SSCS 


13.14.185 Transmit Control Inputs Status Register (Lane 7) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane7TxStat 
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Address 


OxE98113808 

Pis_[Reovedt [RS [x | _[AhaoradsaslOOSCSCSCSCSCSC~*d 
PTETs | TxRdgerate [RS [|__| Bagerate contro SSS 
2:10 [TxAtien | RS__[X |_| Attenuation amount control SSCS 
36 | TxBoost_[RS__[X |__| Boost amount control____—SSSS—S—S—S— 


PS [Reserved _[RS__[X |__| Always reads as. ———SOSOSCSCSCSSCSC—S 
PoP TSCikAlign [RS [|X| | Command to align docks SSCS 
Pat_[TxEn | RS__[X | | Transmit enable control SSS 
POP TsCkoFn [RSX |__| Tccko clock enable. ——SCSCSCSCSCCCCC~*' 


13.14.186 Receiver Control Inputs Status Register (Lane 7) 


Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 
R_PciePhyCrLane7RxStat 


Address 
0xE98113810 


Pid [Reserved | RS_[X |_| AlwaysreadsasSSOSCSCSCSCSCSCSCSCSCSCS~*d 
P1sTz [Tos RS__[X__|__[ 08 filtermg mode control SSCS 
pir [DpliReset_— [RSX [Pi reset contro ——SSSSSCSCSCSCSCSCSCSCC—CSY 
[108 [RxDpliMode [RS—_[X | ___[ DPLE. mode controk SSS 
P75 _[RxFaVal__[RS—_[X___|___[ Equalization amount conol——SC—S 
pa RxTermen [RS __[X___[ | Receiver termination enable. SSCS 
[3 RxAlignEn [RS __[X___ |__| Receiver alignment enable _————*+Y 
pain RSX | Receiver enable control SSS 
PP RxPIPwron [RS___[X___ |__| PLL power state controk_———SS—S—S—S 
Po [HaliRate [RS [X__[__[Diital halFrate data contol SSCS 


13.14.187 Output Signals Status Register (Lane 7) 


Description 


Status of output signals Reset value depends on inputs 


Register 
R_PciePhyCrLane7OutStat 


Address 
0xE98113818 


Always reads a 

Transmit receiver detection result. 

|__| Transmit operation is complete output. 
2 R Loss 31 j 


p2 [tos J 
PoP Rava TRS] 
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13.14.188 Transmitter Control Inputs Override Register (Lane 7) 
Description 


Override of Transmitter control inputs 


Register 
R_PciePhyCrLane7TxOvrd 


Address 


0xE98113820 

fis__[Owd [RWS [0 |_| Phableovennide ofall bits m this regen ——SSSCSC=*Y 
CTs | TxBdgerate [RW [0x0 |_| Bdgerate control ——SSSSS—S—S~—S~S 
P20 | TxAtten [RW | 0x0_| | Attenuation amount controk——S—S—SCS 
[9% | TxBoost_ [RW | 0x0_[ | Boost amount control___——S—S—S—SCSCS 


Ps [Reserved [RW [0 |_| NoeictSSSCSCSCSOSCSCCCC~C~*' 
Pf PTsCikAlign [RW [0 |__| Command to align cooks SSCS 
Pat [TxEn [RW [03 |__| Transmit enable control SSS 
PoP TsCkokn [RW [Tt _ [| Tccko clock enable. ——S—SCSCCC~* 


13.14.189 Receiver Control Inputs Override Register (Lane 7) 


Description 


Override of Receiver control inputs 


Register 
R_PciePhyCrLane7RxOvrd 


Address 
0xE98113828 


it _[owd | RWS[0 |__| Bnabloovennide ofall bis mn this rater = 
ise [Tost RW [0st |__| E08 fiftermg mode control ———SSOSOSC~—S 
Pi [plese [RW [0 |_| DPLL reset controk__——SSCSCSSS 
[108 [RxDpiiMods [RW _[0xd__[—__[ DPLE. mode controh ——SSS—S—S—SCS—SCS*'Y 
RxEqVal__[RW__[0x0__[ | Baualization amount controk——S—SS 
PReTermEn [RW [1 |__| Receiver temmination enable. 

PRW [| ____[ Receiver alignment enable. SSS 


RxPlPwron | RW [1 |—s|:- PLL power state control. 
[HaliRate [RW [0 [| Digital halfrate data control 


RW 
RW 
pa [Rin RW [| Receiver enable control SSCS 
RW 
re RW 


13.14.190 Output Signals Override Register (Lane 7) 
Description 


Override of output signals 


Register 
R_PciePhyCrLane7OutOvrd 
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Address 


0xE98113830 


[S_[Ond [RWS [0 |_| Pnableownide fall bis mths neato 
Pa [asRxpres [RW [1 |__| Transmit receiver detection result. 
[3 [TxDone [RW [0 [Transmit operation is complete output. ————————| 


[2 [hos RW [0 oss of signal output. 
PT _[RsPistate_[ RW [0 _[__| Current state of Rx PISS 
Po [xvid RW [1 [Receiver valid output OSS 


13.14.191 Debug Control Register (Lane 7) 


Description 


Debug control register 


Register 
R_PciePhyCrLane7DbgCtl 


Address 
OxE981 13838 


[Definition ——SCSC~—~—~—S—SCSCSd 


14:10 | DtbSell RW Select of wire to ae onto DTB bit 1 0 - disabled 1 - 
Peppa PAP [ar Pent coro OOTT TTT 

DtbSel0 RW Select of wire to hs onto DTB bit 0 0 - disabled 1 - 
fee ie lal! ieee | 


a DisableRxCk | RW [0 |__| Disable neck output, ——S—SCSSSS 
3 [Invert [RW 0 |__| Invert receive data (preIbot). ——SOSOS—~—~S—SCS 
[2 [invert [RW [0 |__| Tnvert transmit data (post-lbert) SS 
PT | ZeroRxData_[ RW [0 |__| Overvide all receive data to zeros——SSSCSCS 
[0 | ZeroTxData [RW [0 |__| Overvide all transmit data to zeros SS 


13.14.192 Pattern Generator Controls Register (Lane 7) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrLane7PgCtl 


Address 
0xE98113880 


aa [Pa [RW [00 | | PattmformodsssSSOSC~—~SCSC~*d 
Ps [Tigger [RW [0 |__| Insert a single error mtoaSb—SSSOS—S—SCS 


| 2:0 | Mode =| RW | 0x0 =| ~_‘[| Pattern to generate 0 - disabled 1 - lfsr15. 


13.14.193 Pattern Matcher Controls Register (Lane 7) 
Description 


Pattern Matcher controls 
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Register 
R_PciePhyCrLane7PmCtl 


Address 
OxE981 138C0 


[Definition —SCSCSCSCSCSCSCSCSCSCSC*Sd 


Syne RW ————— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode RW Pattern to match 0 - disabled 1 - lfsr15 2 - lfsr7 3 - d[{n 
= d{n-10] 4 - d{n| = 


13.14.194 Pattern Match Error Counter Register (Lane 7) 


Description 

Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 
Register 

R_PciePhyCrLane7PmErr 


Address 
0xE981138C8 


Attributes 


-noregtest 


fis [Ovid [RWS [X |_| active, multiply COUNT Oy PS SSCSCSC~* 


14:0 | Count RWS xX Current error count If OV14 field is active, then multiply 
count by 128. 


13.14.195 Current Phase Selector Value. Register (Lane 7) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrLane7Phase 


Address 
0xE981138D0 


Attributes 


-noregtest 


ior [val RWS [0x0 | «(| Cann phase sdlector value ——SSSSCSCSCSCSCS 
po [Dime RWS [0 Carrent phase selectorvalue 
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13.14.196 Current Frequency Integrator Value. Register (Lane 7) 
Description 


Current frequency integrator value. 


Register 


R_PciePhyCrLane7Freq 


Address 


0xE981138D8 


Attributes 


-noregtest 


Pisa [val RWS [0x0 |__| Current frequency integrator value 


Po [Dine RWS [0 [| Curent frequency integrator value 


13.14.197 Scope Control Register (Lane 7) 
Description 


Control bits for per-transceiver scope portion 


Register 


R_PciePhyCrLane7ScopeCtl 


Address 


0xE9811338E0 


14:11 | Base = = = [RW [0x0 | ~—__|- Which bit to sample when MODE = 1. 
| 10:2 | Delay =| RW [0x0 =| ‘| Number of symbols to skip between samples. 


1:0 Mode RW 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.198 Recovered Domain Receiver Control Register (Lane 7) 


Description 


Control bits for receiver in recovered domain 


Register 


R_PciePhyCrLane7RxCtl 


Address 


0xE9811338E8 
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SwitchVal }RW [0 | _|[ Value to override the data/phase mux. 
OvrdSwitch Override the value of the data/phase mux. 
ModeBp RW 


12:10 [iia Set BP 2:0 to longer timescale (for FTS patterns) BPO - 


Start PHUG profile at 4/. 


38 [RugVahe [RW 00_[—__] Ovenide vale for RUGS 
P76 | PhngValne [RW _[00__[ | Overvide value for PHUG.————SSS—S—S——S 


OvrdDpllGain ;-RW [0 |. Override PHUG and FRUG values. 
PhdetPol |-RW [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge RW Edges to use for phase detection top bit is rising edges, 
bottom is falling. 


PhdetEn fee ee | Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.199 Receiver Debug Register (Lane 7) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrLane7RxDbg 


Address 
0xE981138F0 


DtbSel1 |RW | 0x0 =| ~~—_[ Select wire to go on DTB bit 1. 


DtbSel0 |RW [| 0x0 =| ~————[ Select wire to go on DTB bit 0. 


13.14.200 RX Control Register (Lane 7) 


Description 


RX Control Bits 


Register 
R_PciePhyCrLane7RxAnaCtrl 


Address 
0xE98113980 


Attributes 


-noregtest 


Se a 

}4 | RxlbiEfn |RW [0 ~~ [|__| Digital serial (internal) loopback enable bit. 

[3 [Reba | RW—| 0 || Wafer level (external) loopback enable bit. 

AOE [0 ek 95 enable bie 
Po | [Margin enable bit 
POF nate 
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13.14.201 RX ATB Register (Lane 7) 
Description 


RX ATB bits 


Register 
R_PciePhyCrLane7RxAnaAtb 


Address 
0xE98113988 


Attributes 


-noregtest 
[5 [SensemVrefLos [RW [0 [|__| Connect atb_sm to vrel_los (vrelrx/14). 


Pa [SensemVem [RW [0 |__| Connect atbamtoRXvon. SSS 
[3 _[SensemRxM [RW [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [RW [0 |__| Connect atbaptomep. —SSS—S—SC—SCSCSCSC—S 
PT | ForeopRsMt [RW [0 [|__| Comnect-atb-Ep to nom. 
PO | ForcepRxP [RW [0 |__| Connect atb-tptonep. SSS 


13.14.202 8 Bit Programming Register (Lane 7) 
Description 


8 bit programming register 


Register 
R_PciePhyCrLane7PllPrg2 


Address 
0xE98113990 


Attributes 


OPES LEE! 


[Denton ——s—C—~“*S*S*SC“‘“S*S*S*SCS*S 
AtbSenseSel RW Control of ——— charge pump current 1=Enable 
signals internal to the PLL. 
FrceHcpl RW Allow override of default value of hep] 1=allow hcpl_lcl to 
ae high-couplin. 


aa HeplLcl ee | 1=force coupling in vco to maximum. ssid 


FrcPwron |_| ie override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | Pwronkcl | 10 =| ‘| 1=power is supplied to the PLL. 


Se ocr Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


[Reeth RW [0 [___[ T= PEL is held placed in reset 


P| EnableTestPd pee 1=phase linearity of phase interpolator and VCO is being 
tested. 
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13.14.203 10 Bit Programming Register (Lane 7) 
Description 


10 bit programming register 


Register 
R_PciePhyCrLane7PllPrg1 


Address 
0xE98113998 


Attributes 


-noregtest 


Po [Unset [RW [1 |__| Unusoa 


SelRxck }RW [0 =| ~__[ Use recovered clock as reference to the PLL. 
7:5 | PropCntrl RW Ox5 rol Control of Proportional charge pump current Propor- 


tional current = (n+1)/8*fulL. 


4:2 | IntCntrl RW 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*full_scale De. 


Po [Unsed RW os sed SSOSOSCSCSOSOSCSCSCSCSC*S 


13.14.204 10 Bit Programming Register (Lane 7) 


Description 


10 bit programming register 


Register 
R_PciePhyCrLane7PllMeas 


Address 
0xE981139A0 


Mnemonié 
Measure copy of bias current in oscillator on atb_force_m. 


MeasVcntrl Measure ventrl on atb_sense_m If MEAS_VREF is set as 


< 
S 
0) 
© 
mn 
w 
a 
© 
mn 


well, atb_sense_p,m mea- su. 

MeasVref Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 

Measure vp16 on atb_sense_p; gd on atb_sense_m. 
MESSIAH Measure startup voltage on atb_sense_p; gd on 
atb_sense_m. 


< 
— 
oO 
© 
n 
< 
ue) 
eH 
for) 


Measure vco supply voltage on atb_sense_p; gd on 

atb_sense_m. 

MeasVpCp Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 

If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 

MEAS_VP_CP is set as well, atb_sense. 

MeasCrowbar Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


= = 
oO oO 
je) fev) 
ae Z 
< = 

(oe) 


9 |} 
Lali 
}6 |} 
5 
oes 
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13.14.205 TX ATB Control Register (Set 1) (Lane 7) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane7TxAnaAtbsell 


Address 


0xE981139A8 


Attributes 


-noregtest 


VbpfSP RW Vbpf in edge rate control circuit on ATB_S_P Set 
cll lt ATBLEN to take this well 

-e_ | Benswr | 
S| tent? [RW [0 [Tr connected fo ATES? Por terme J 


al a Txp connected to ATB. S_P Set ATB- EN to make this 
useful. 


eee ae Soe ee 
PT [wretsP__ [RW fo id ewe OSOC—SOSSOOCCCC“‘CSC*S 
ro [vesP___[Rw foi Re SSCSC—SSCSCC 


13.14.206 TX ATB Control Register (Set 2) (Lane 7) 
Description 


TX ATB Control Bits 


Register 


R_PciePhyCrLane7TxAnaAtbsel2 


Address 


0xE981139B0 


Attributes 
-noregtest 
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[Deinition SSCS 


AtbEn RW Connect internal and external ee busses Needed for all 
ATB measurements. 
ies = We Renes <5". 0 Se et oP | 


o_ [Weiss RW] 


Peas PR Vi ER ATEN TT Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM RW a aie in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 

VbpssP RW Vbps in edge rate control circuit on ATB_S_M Set 
oad ATB_EN to make this useful. 

VbnfiSM RW Vbnf in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


Enlpbk Rian ee Enable TX external loopback Make sure internal loopback 
is not ON 


[OP eatpbk [RW [0 [| Bable TX taternal Toopback—— 


13.14.207 TX POWER STATE Control Register (Lane 7) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrLane7TxAnaControl 


Address 
0xE981139B8 


Attributes 


snO ES UL 


[Definition —SC—C—~—SCSCSCS 


FrcPwrst RW Locally force power state tx_en<1:0> input overridden by 
lal Ll I =a 
EnLcl RW Locally fone tx_en<1:0> 00 - power off 01 - tx idle (slow) 


FrcDo Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


3 DataovrdLcl | RW Local dataovrd control value Set FRC_DO to make this 
Fs wre} ar 
2 FrcBeacon RW Force Beacon to local value (BCN_LCL) When On, 

eelpeeer lee ie Ol! BCN_LVL overrides input value. 
1 BenLcl RW Local Beacon On/Off Control Value Set FRC_BEACON 
Ce PP eis sca ON 


PO [Unused RW Osi Umusedreg SSCS 


13.14.208 PHY Reset Register 
Description 


Write to a 1 to reset Phy Write-only (not a real register). 


Register 
R_PciePhyCrReset 
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Address 


OxE9813F9F8 


|O [Reset = =| WS {0 [| Write toal to reset Phy Write-only (not a real register). 


13.14.209 Transmit Control Inputs Status Register (Broadcast) 
Description 


Status of Transmit control inputs Reset value depends on inputs 


Register 


R_PciePhyCrBcastTxStat 


Address 


0xE98151808 


Attributes 


-noregtest 


Roose 
Pis__[ Reserved [WS [X |_| Alwaysradsasd—SOSOSC~—SCSCSCS 
CTs | TxBdgerate [WS |X |_| Bidgerate control SSCS 
Pat | TxAtten [WS |X |__| Attenuation amount conrok——S—S—S—S—SCS 
[a6 | TxBoost [WS _[X | | Boost amount control___——S—S—SCSCS 


[S| Reserved _[WS[X |__| Alwaysreadsas@.—SSOSCS—~—SSCSCSCCSY 
Pa [TxCkAiign [WS _[X_| | Command to align docks SCS 
Pst_[ xm | WS [X |__| Transmit enable control SS 
Po | TxCkoBa [WS [X__| | Tx-cko clock enable. SCS 


13.14.210 Receiver Control Inputs Status Register (Broadcast) 
Description 


Status of Receiver control inputs Reset value depends on inputs 


Register 


R_PciePhyCrBcastRxStat 


Address 


0xE98151810 


Attributes 


-noregtest 
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pid [Reoved [WS [X |_| AlvaysradsasSSSSCSC~—CSCSCSCS~S~S 
isTa [Tos [WS [X___ |__| £08 filtermg mode control SSCS 
Pi [DpliReset__[ WS__[X__ |__| DPE reset controk__——SSSCSCSSY 
"108 [RxDpiiMods |WS_[X___ |__| DPE. mode controh ———SS—S—SCSCS—SCC—*Y 

PRxEqVal__[WS_[X___ |__| Baualization amount contro 

ReTermEn Px [| Receiver termination enable. SSS 
[ [Receiver alignment enable. 


RxPllPwron 


|X | ‘| PLL power state control. 
|X |__| Digital half-rate data control. 


pa [rin WS [X__[ [Receiver enable control SSCS 
X 
ae 


13.14.211 Output Signals Status Register (Broadcast) 
Description 


Status of output signals Reset value depends on inputs 


Register 


R_PciePhyCrBcastOutStat 


Address 
OxE98151818 


Attributes 


-noregtest 


Reset 
[S_[Reaved [WS [X |__| Always readsas T 

Pa [isRxpres_[ WS [X_[__[ Transmit receiver detection result 
P3_[TxDone [WS _[X___[-___| Transmit operation is complete output 

2 3 31 ’ 


pa [Tos X_ |__| Toss of signal output SOS 
PT _[RxPiState_[ WS [X__ |__| Current state of Rx PLS 
Po [Rxvalid [WS [x [Receiver valid ontput SSS 


13.14.212 Transmitter Control Inputs Override Register (Broadcast) 
Description 


Override of Transmitter control inputs 


Register 
R_PciePhyCrBcastTxOvrd 


Address 
0xE98151820 


Attributes 


-noregtest 
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[is_[Omd [WS [0 |_| Bnabloovenide ofall bits im this reais 
ars [Teragerate [WS [030 |__| Bageratecontrok SSCS 
210 | TxAtien [WS | 0x0__| | Attenuation amount contro SSS 
9:6 _[TsBoost [WS _[00__[ [Boost amount control———SSSCSCSCSCSSC—S 


Ps [Reserved [WS [0 |_| Noelfect. ——SSSOSCSCSCSSC“‘C;*” 
PoP TCikAlign [WS [0 |__| Command to align docks SSCS 
Pat [TsEn | WS_[ |__| Transmit enable control SSS 
PoP TxCkorn [WS [1 |__| Tccko clock enable. SSCS 


13.14.213 Receiver Control Inputs Override Register (Broadcast) 
Description 


Override of Receiver control inputs 


Register 


R_PciePhyCrBcastRxOvrd 


Address 


0xE98151828 


Attributes 


-noregtest 

[14 [| Ovrd =| WS = [0 ~~ |__| Enable override of all bits in this register. 
Taz [Lost | WS Oe —[ [608 filtering mode contr 

DpliResect | WS [0 = | | DPLLI reset control. 


RxDplMode | WS | 0x4 |_| DPLI. mode control, 
RxBiqVal_ [WS [| 0x0__| | Banallization amount controk_————SSSSSS— 


Pf [ReTormEn [WS [1 |__| Receiver termination enable. Ss 


Po [raliRate | Ris = 


[ [Receiver alignment enable. SSS 
- [Receiver enable contro SSCS 
[_ [PLL power state control SCS 
[Digital halFrate data control SS 


13.14.214 Output Signals Override Register (Broadcast) 
Description 


Override of output signals 


Register 


R_PciePhyCrBcastOutOvrd 


Address 


0xE98151830 
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Attributes 


-noregtest 


S_[Ond [WS [0 |__| Pnabloomnide fall bis m this neato 
Pa [sRpres [WS [1 |__| Transmit receiver detection result. 
P3_[TxDone___ [WS [0 |__| Transmit operation is complete output 


pa [hos WS 0 aoss oF signal output 
PT _[RsPistate_[ WS [0 |__| Current state of Rx PIL—SOSOSOSOSCSCS 
Po [Ravalid [WS [Receiver valid output OSS 


13.14.215 Debug Control Register (Broadcast) 
Description 


Debug control register 


Register 
R_PciePhyCrBcastDbgCtl 


Address 
0xE98151838 


Attributes 


ROPES IESE 


[Deinition CSCS 


14:10 | DtbSell WS Select of wire to joe onto DTB bit 1 0 - disabled 1 - 
Pepa Po [ee ae OT TIT 
9:5 DtbSel0 WS Select of wire to aa onto DTB bit 0 0 - disabled 1 - 
Pe Mee ees et ieee | 


Pa [DiableRxOk [WS [0 [|__| Disable neck output. SSCS 


[3 Pinvertx [WS [0 |__| Invert reveive data (preTber) SS 
[2 Pinvert Tx [WS [0 [invert transmit data (postbet) 
PT [ZeroRxData_[ WS [0 |__| Override all receive data to zeros 
[0 [ZeroTData_ [WS [0 |__| Override all transmit data fo zeros 


13.14.216 Pattern Generator Controls Register (Broadcast) 


Description 


Pattern Generator controls 


Register 
R_PciePhyCrBcastPgCtl 


Address 
0xE98151880 


Attributes 


-noregtest 
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risa [ Pad [WS [0 | | Pationformodas ss. SSOSSCSCSCSC~“~S~S~*S 
3 [Trigg [WS [0 |__| Tisert a single enor mows OSS 


| 2:0 | Mode =| WS | 0x0 =| ~——si|:~ Pattern to generate 0 - disabled 1 - Ifsr15. 


13.14.217 Pattern Matcher Controls Register (Broadcast) 


Description 


Pattern Matcher controls 


Register 
R_PciePhyCrBcastPmCtl 


Address 
0xE981518C0 


Attributes 


sOTeetes 


[Deinition—SCSCSCSCSCSCSCSCSCSC*d 


Syne WS ——— pattern matcher LFSR with incoming data 
must be turned on then off t. 


Mode Pattern to match 0 - disabled 1 - Ifsr15 2 - lfsr7 3 - d[n 
= d[n-10] 4 - d[n| = 


13.14.218 Pattern Match Error Counter Register (Broadcast) 


Description 


Pattern match error counter A read resets the register. When the clock to the error counter is off, reads and 
writes to the register are queued until the clock is turned back on 


Register 

R_PciePhyCrBcastPmErr 
Address 

0xE981518C8 
Attributes 

-noregtest 


| ws | Pe active, | If active, multiply COUNT by 128. COUNT by 128. 


a 0 | Count te TTI error count If OV14 field is active, then multiply 
count by 128. 


13.14.219 Current Phase Selector Value. Register (Broadcast) 


Description 


Current phase selector value. 


Register 
R_PciePhyCrBcastPhase 
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Address 
OxE981518D0 


Attributes 


-noregtest 


por [val WS [x0 | (| Canent phase solectorvalne——SOSSOSCSCSCSCS~S 
po [Dim ws fo Canrent phase selector value 


13.14.220 Current Frequency Integrator Value. Register (Broadcast) 
Description 


Current frequency integrator value. 


Register 
R_PciePhyCrBcastFreq 


Address 
0xE981518D8 


Attributes 


-noregtest 


Pisa [val [WS [0x0 |__| Current frequency integrator value 


[odie [WS [0 |__| Curent frequeney integrator value 


13.14.221 Scope Control Register (Broadcast) 


Description 


Control bits for per-transceiver scope portion 


Register 
R_PciePhyCrBcastScopeCtl 


Address 
0xE981518E0 


Attributes 


-noregtest 


14:11 | Base = =| WS [0x0 [| Which bit to sample when MODE = 1. 
| 10:2 | Delay =| WS ~~ | 0x0 =| ~——_—|: Number of symbols to skip between samples. 


1:0 Mode WS 0x0 Mode of counters 0 = off 1 = sample every 10 bits (see 
BASE) 2 = sample every 11. 


13.14.222 Recovered Domain Receiver Control Register (Broadcast) 


Description 


Control bits for receiver in recovered domain 
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Register 
R_PciePhyCrBcastRxCtl 


Address 
OxE981518E8 


Attributes 


-noregtest 


SwitchVal qe} 4 Value to override the data/phase mux. 


| 13 | OvrdSwitch | 10 =| ~~}: Override the value of the data/phase mux. 


| Override the value of the data/phase mux. 
—_ 10 So 0x0 Set BP 2:0 to longer timescale (for FTS patterns) BPO - 
os | Frgvalne [WS 0x0 | J] Overide value for FROG] 
[7:6 PhgValue [WS [0x0 Overside value for PHUG. 


OvrdDpllGain hii han ee Override PHUG and FRUG values. 
PhdetPol | WS [0 |__| Reverse polarity of phase error. 


3:2 PhdetEdge WSs 0 Edges to use for phase detection top bit is rising edges, 
bottom is falling. 
PhdetEn eee Enable phase detector top bit is odd slicers, bottom is 
even. 


13.14.223 Receiver Debug Register (Broadcast) 


Description 


Control bits for receiver debug 


Register 
R_PciePhyCrBcastRxDbg 


Address 
OxE981518F0 


Attributes 


-noregtest 


DtbSel1 | WS =| 0x0 =| ~—_—_—i[ Select wire to go on DTB bit 1. 


DtbSel0 | WS | 0x0 =| ~———[ Select wire to go on DTB bit 0. 


13.14.224 RX Control Register (Broadcast) 


Description 


RX Control Bits 


Register 
R_PciePhyCrBcastRxAnaCtrl 


Address 
0xE98151980 
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Attributes 


-noregtest 


Ps —[Unsed [WS [1 |__| Unused 
14 | Rxlbifn | WS [0 [|__| Digital serial (internal) loopback enable bit. 
[3 RelbeFin [WS [0[____| Water level (external) loopback enable bit. 


pa [Rekeasen [WS [0 |__| Rek025 enable bit SSSOSOSCS—S 
PT [Merginn [WS [0 |__| Margin enable bit. SSCS 
Po _[Atbin [WS _[o_[__[ ATBenablebit. SSC 


13.14.225 RX ATB Register (Broadcast) 
Description 


RX ATB bits 


Register 


R_PciePhyCrBcastRxAnaAtb 


Address 
0xE98151988 


Attributes 


-noregtest 
PS __[SensemVrefos [WS [0 |__| Connect atb_s.m to vref_los (vref-rx/14). 


Pa [SensemVem [WS [0 |__| Connect atbsmtoRX von. SSS 
[3 _[SensomRxM— [WS [0 |__| Connect atbsmtorem SSS 


[2 [SensepRxP [WS [0 |__| Connect atbaptomep. SSCS 
PT | ForeepRaM [WS [0 |__| Connect atb-tptonem. SSS 
Po [ForeepRxP [WS [0 [|__| Connect atb-Eptonep.—SSSS—S—SOSCSCSCSCSCS*S 


13.14.226 8 Bit Programming Register (Broadcast) 
Description 


8 bit programming register 


Register 


R_PciePhyCrBcastPllPrg2 


Address 


0xE98151990 


Attributes 


-noregtest 
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[Denton _—Ss—C—~“*S*SC“‘“S*S*S*S~*S 
AtbSenseSel WS —— of Proportional charge pump current 1=Enable 
signals internal to the PLL. 
FrceHcpl WS Allow override of default value of hcp] 1=allow hcpl_lcl to 
control high-couplin. 


Ea HeplLcl |WwS [0 | __ |{ 1=force coupling in vco to maximum. 


FrcPwron WS Allow override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 3 | PwronLcl | | |: 1=power is | 1=power is supplied tothe PLL. to the PLL. 


Ls ero Le a Bc override of default value of pll_pwron 1=allow 
pwron_lcl to control pll po. 


| 1 | ResetLcl =[WS = [|0 | | 1=PLL is held/placed in reset. 


pe EnableTestPd be ae 1=phase linearity of phase interpolator and VCO is being 
tested. 


13.14.227 10 Bit Programming Register (Broadcast) 


Description 


10 bit programming register 


Register 


R_PciePhyCrBcastPllPrg1 


Address 


0xE98151998 


Attributes 


-noregtest 


fo [Unecdt [WS [1 |  [Umed——S=~—“‘CS;7XCS; ; SCS” 
Ps [Sack [WS [0 |_| Use recovered clock as reference to the PLL _———=s 


7:5 | PropCntrl WS 0x5 Control of Proportional charge pump current Propor- 
tional current = (n+1)/8*fulL. 


IntCntrl 0x2 Control of Integral charge pump current Integral current 
= (n+1)/8*fulLscale De. 


PEO [Unused PWS [ot |] Tse 


13.14.228 10 Bit Programming Register (Broadcast) 
Description 


10 bit programming register 


Register 


R_PciePhyCrBcastPllMeas 


Address 
0xE981519A0 
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Attributes 


-noregtest 


Measure vref on atb_sense_p; gd on atb_sense_m If 
MEAS_VCNTRL is set as well, at. 


startup voltage on atb_sense_p; gd on 
atb_sense_m. 

Measure vco supply voltage on atb_sense_p; gd on 
atb_sense_m. 

Measure vp_cp voltage on atb_sense_p; gd on atb_sense_m 
If MEAS_1V is set as wel. 

Measure 1V_ supply voltage on atb_sense_m If 
MEAS_VP_CP is set as well, atb_sense. 

Measure crowbar bias voltage on atb_sense_p; gd on 
atb_sense_m. 


Unused WS Unused. 
|O [Unused | WS |Unused§ —“‘“‘CSC*~*C 


13.14.229 TX ATB Control Register (Set 1) (Broadcast) 
Description 


TX ATB Control Bits 


Register 
R_PciePhyCrBcastTxAnaAtbsell 


Address 
0xE981519A8 


Attributes 


-noregtest 


7 VbpfSP WS Vbpf in edge rate control circuit on ATB_S_P Set 
ie ae Pee 

(6 [Bens WS [0 |__| Tian on ATB-S-M Set ATB-EN to make this wsohil | 
Ps | tmFP [WS [0 |] Txm connected to ATELS-P Fortemm. sd 


ee ee ee Txp connected to ATB_S_P Set ATB_EN to make this 
useful. 


PBeprP_ | WS__[0 |_| Txp connected to ATB-F-P For term: 


VrelSP a 
VersP PW Reg 


13.14.230 TX ATB Control Register (Set 2) (Broadcast) 


W a 
fo 


S 
S 


Description 


TX ATB Control Bits 
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Register 
R_PciePhyCrBcastTxAnaAtbsel2 


Address 
0xE981519B0 


Attributes 


HOTEB LES! 


[Deinition——SSCSCSCSCSCSCSCSCSCSC*d 


AtbEn WS Connect internal and external a busses Needed for all 
ATB measurements. 


Po [reid WSR 


Eee ia) La Vem replica on ATB_S_P Set ATB_EN to make this use- 
4 


VbnsSM ie in edge rate control circuit on ATB_S_M Set 
ATB_EN to make this useful. 


ATB_EN to make this useful. 
VbnfSM ae Vbnf in edge rate control circuit on ATB_S_M Set 


VbpsSP Lo Vbps in edge rate control circuit on ATB_S_M Set 


ATB_EN to make this useful. 


1 Enlpbk Enable TX external loopback Make sure internal loopback 
is not ON. 


[0 [matsiipbk [WS [0 |__| Bhable TX maternal Toopback SSCS 


13.14.231 TX POWER STATE Control Register (Broadcast) 


Description 


TX POWER STATE Control Bits 


Register 
R_PciePhyCrBcastTxAnaControl 


Address 
0xE981519B8 


Attributes 


SHOES ES? 


[Definition ——SCSC—~—~—SCSCSCS 


FrcPwrst WS — force power state tx_en<1:0> input overridden by 
EN_LCL. 


EnLcl 0x0 Locally force tx_en<1:0> 00 - power off 01 - tx idle (slow) 
10 - transmit data 1. 

FrcDo WS Force Dataovrd locally When ON, overrides input 
data_ovrd value. 


faa DataovrdLcl oad Local dataovrd control value Set FRC_DO to make this 
useful. 


2 FrcBeacon WS Force Beacon to local value (BCN_LCL) When On, 
BCN_LVL overrides input value. 


1 BenLcl WS Local Beacon On/Off Control Value Set FRC_BEACON 
to make this useful. 


PO [Unased WSO imasedieg OSSSOSCSSCSCS 
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13.15 Transaction, Link, MAC Layers 


Please reference the Synopsys’ “PCI-Express Controller Core Data Book”. This provides a description of the 
pins, the timing requirements, and the programmer-visible registers. We have configured the PCI-Express RC core 


to fit our use. This section documents our configuration choices. 


Parameters for design DWC_pcie_rc 


General Configuration 


Default? 


Symbols per Cycle 
/ Operating 
Frequency 


Maximum Number 
of Lanes Supported 
Datapath Width 


Number of Virtual 
Channels 


Enable ECRC 
Support 


RAM data error 
protection config 


Pee ed 
RAM ECC 

ppetineombte || 
Remove Port Logic 
[Regier | 
Use RocketIO 
pay 
DBI ReadOnly Ox1 
wie Brae || 
Ee al 
Include Target 1 
[ineriacet | 
Application Error 
[Reporing | 
Mask Completion 

Timeout Errors i} 
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Parameter Name : CX_NB Specifies the operating frequency of the 
core. This is also referred to as the number of symbols that are 
handled each core_clk cycle, or the S-ness of the core. 15 => 
250MHz : 25 => 125MHz. 

Parameter Name : CX_NL Specifies the maximum number of lanes 
that are supported by the core. 

Parameter Name : CX_NW Specifies the width of the datapath 
(number of dwords per cycles) 

Parameter Name : CX_NVC Specifies the number of Virtual 
Channels supported by the core. A maximum of eigth VCs are 
supported. 
Parameter Name : CX_ECRC_ENABLE Removes support for 
ECRC. May be disabled for smaller gate size if the Core is placed in 
a system where it’s guaranteed that received TLPs don’t contain 
ECRC AND the Application does not transmit ECRC from the 
Client interfaces. This option is only available when Include Target 
Interface 1 is selected. 

Parameter Name : CX_RAM_PROTECTION_MODE RAM data 
error protection mode Parity: Selects parity to check RAM data 
error. ECC: Selects ECC to check and correct RAM data error. 
None: Disables both parity and ECC modes. 

Parameter Name : CX_PAR_MODE Config RAM data width per 
parity bit 
Parameter Name : CX_ECC_PIPE_EN Enable RAM ECC pipeline 


Parameter Name : CX_PL_REG_DISABLE Removes Port Logic 
registers 
Parameter Name : RIOLPOPULATED FPGA design using Xilinx 
RocketIO PHY 

Parameter Name : CX_DBILRO_WR_EN Enable ReadOnly /HwInit 
registers to be writable through DBI 

Parameter Name : FPGA This parameter specifies FPGA 
application 
Parameter Name : TRGT1_POPULATE Specifies the inclusion or 
omission of the Target interface 1. 

Parameter Name : APP_RETURN_ERR_EN Determines whether 
to include input ports for application-detected error reporting. 
Parameter Name : CPL-TIMEOUT_ERR_MASK When defined, no 
error will be reported to the CDM, application is responsible for 
returning cpl timeout error through the application error return 
interface ((APP_RETURN_ERR_EN). 
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Value Disabled 


Enable Address Parameter Name : GLOB_ADDR_ALIGN_EN Allows the 
Alignment application to enable address alignment When enabled, the core 
performs address alignment and generates the first and last byte 
enables based on the address and number of bytes of the TLP 
requested from the client interface. NOTE: (This note applies to all 
Switch Applications and to EP, RC, or Dual Mode Applications 
that don’t have the CXK_ECRC_EN macro defined): For Switch 
Applications and other Applications where the CX_ECRC_EN is 
not defined, this should normally be disabled. However, if the 
Application requires this to be enabled, then the address alignment 
pin at the top-level of the Application should only be high for those 
TLP’s without ECRC. TLP’s w/ ECRC that are being transmitted 
by the Application, the address alignment pin should be de-asserted 
for that TLP. 

Parameter Name : CX_LANE_FLIP_CTRL_EN provide control 
allowing physical RX/TX lanes to be flipped when this feature is 
enabled, two pins are provided to control of RX/TX separtely. 
input rx_lane_flip_en; 0 — requires the RX LSB lane (i.e laneO) to be 
physically presented. 1 — enables flipping, of the RX MSB lane to 
laneO. input tx_lane_flip_en; 0 — requires the TX LSB lane (i.e 
lane0) to be physically presented. 1 — enables flipping, of the TX 
MSB lane to laneO. 

Parameter Name : CX_NFTS Specifies the number of Fast Training 
Sequences the core advertises during link training. This is used to 
inform the link partner the cores ability to recover synchronization 
after a low power state. This number should come from the SerDes 
vendor. Legal values are in the range 1 - 255 


Parameter Name : CXK-COMM_NFTS Specifies the number of Fast 


Provide Control to 
Flip Physical 
RX/TX Lanes 


Number of Fast 
Training (NFTS) 
Sequences 


e 
oO 


NFTS when using 
common clock 


= 
oN 


Training Sequences the core advertises during link training when 
common clock configuration is set. Legal values are in the range 1 - 
255 

Technology Speed Parameter Name : CX_TECHNOLOGY Specifies the speed of the 
technology relative to the clock frequency and architecture. This 
parameter is used to enable additional pipelined stages internal to 
the core to tradeoff latency and gates for ease of timing closure. 
Note: This is always SLOW for FPGA’s 


Disable Lane Ox0 Parameter Name : CX _DESKEW_DISABLE Enable or disable lane 
Deskew deskew. This should be used with care. 


Enable Lane Ox1 Parameter Name : CX_LANE_REVERSE Enable or disable core 
support for lane reversal 

Enable ASPM L1 Parameter Name : CX_ASPM_TIMEOUT_ENTR_L1_EN Enable or 

Timeout disable the ASPM L1 timer. When enabled, core will automatically 
go to L1 when the timer expires and the conditions in the PCle 
Specification are met. 
Parameter Name : CX_MAX_TAG Specifies the maximum number 
of tags supported by the core. Used to size the completion 
look-up-table and timeout ram. 
Parameter Name : CX_LBC_EXT_AW Specifies the width of the 
external Local Bus Controllers (LBC) address bus. Note: This 
feature is not applicable for RC. 
Enable Diagnostic Parameter Name : DIAGNOSTIC_ENABLE Enables routing of 
Bus important diagnostic signals out of the top level. 
Maximum Payload Parameter Name : CX_MAX_MTU Specifies the maximum packet 
Size Supported payload size supported by the core. This parameter is used to size 

core memories. 


; 


w 
rary 


Ww 
i) 


Maximum Tags 
Supported 


LBC Address Bus 
Width 
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Default? 


Parameter Name : 


Enable Optional 
Checks 


RAM Configuration 


Use External 
RAMs 


RAM Type 


RAM Timing 
Model 


single port RAM 
Read Access Time 
[ps] 

single port RAM 
Address/Data 
Setup Time [ps] 
dual port RAM 
Read Access Time 
[ps] 

dual port RAM 
Address/Data 
Setup Time [ps] 


Transmit Configuration 


ENABLE_OPTIONAL_CHECKS Adds optional 


protocol checks including byte enable and flow control. 


Parameter Name : 


CX_RAM_AT_TOP_IF Specifies whether to use 


extrnal RAMs and include top-level interface or used embedded 


RAMs 


Parameter Name : 


model to use 


Parameter Name : 


CX_RAM_TYPE Specifies the type of ram 


RAM_TIMING_MODEL Specifies whether to 


use Black Box timing, or physical RAM timing model if black box 
timing is specified, Black box RAMS will be used to synthesize, and 
timing constraints for RAM interfaces will be derived from 
RAM*P_RD_ACCESS/RAM*P_ADDR_SU parameters. if physical 
RAM timing is specified, it is expected to be provided by your 


physical RAM model. 
: RAMIP_RD_ACCESS Specifies the single port 
RAM read Access time 


Parameter Name 


Parameter Name 


used by synth. timing model | 


[used by synth. timing model ] 


: RAM1P_ADDR_SU Specifies the single port 


RAM data setup [used by synth. timing model | 


Parameter Name 


Parameter Nam 
data setup 


: RAM2P_RD_ACCESS Specifies the dual port 
RAM read Access time 


[used by synth. timing model ] 


e: RAM2P_ADDR_SU Specifies the dual port RAM 
used by synth. timing model | 


Default? 
Y 


Include 3rd Client 
Interface 


Block Client 0 
Interface 
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Parameter Name : 


CLIENT2_POPULATED Determines whether to 


include top-level ports for the optional third application transmit 
client interface (XALI2). 


Parameter Name : 


CX_CLIENTO_BLOCK_NEW_TLP This is 


designed to allow customer to select whether or not to allow XADM 
arbiter to block clientO interface When PMC is enabled with L1 and 
L2, L3, there will be conditions that new TLP should be blocked. 
But completions are always need to go. Therefore if customer 
configures the completion and new TLP requests combined into 
clientO interface, then it needs to set this value to 0 and takes over 
the blocking function by monitoring the output signal 
pm_xtlh_block_tlp. Note: If core lbc is used or one client interface is 
used for completions, then these block parameters should be set 
accordingly. For example, if client0 interface has been used for 
completion, then the parameter for client0 should be set to ’0’ so 
xadm arbiter will not block this interface. 
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Description 


Block Client 1 Parameter Name : CX_CLIENT1._BLOCK_NEW_TLP This is 

Interface designed to allow customer to select whether or not to allow XADM 
arbiter to block client1 interface When PMC is enabled with L1 and 
L2, L3, there will be conditions that new TLP should be blocked. 
But completions are always need to go. Therefore if customer 
configures the completion and new TLP requests combined into 
client1 interface, then it needs to set this value to 0 and takes over 
the blocking function by monitoring the output signal 
pm_xtlh_block_tlp. Note: If core lbc is used or one client interface is 
used for completions, then these block parameters should be set 
accordingly. For example, if clientl interface has been used for 
completion, then the parameter for clientl should be set to ’0’ so 
xadm arbiter will not block this interface. 

Block Client 2 Parameter Name : CX _CLIENT2_.BLOCK_NEW_TLP This is 

Interface designed to allow customer to select whether or not to allow XADM 
arbiter to block client1 interface When PMC is enabled with L1 and 
L2, L3, there will be conditions that new TLP should be blocked. 
But completions are always need to go. Therefore if customer 
configures the completion and new TLP requests combined into 
client1 interface, then it needs to set this value to 0 and takes over 
the blocking function by monitoring the output signal 
pm_xtlh_block_tlp. Note: If core lbc is used or one client interface is 
used for completions, then these block parameters should be set 
accordingly. For example, if client1 interface has been used for 
completion, then the parameter for clientl should be set to ’0’ so 
xadm arbiter will not block this interface. 

Populate ports for Parameter Name : XADM_CRD_EN This parameter enables the 

available credit population of output ports for application monitoring of run-time 

buses Avaliable credit information for VCn buses: xadm_ph_cdts 
[NVC*8-1:0] : available VCO - VCn header posted credits 
xadm_nph_cdts [NVC*8-1:0] : available VCO - VCn header 
non-posted credits xadm_cplh_cdts [NVC*8-1:0] : available VCO - 
VCn header completion credits xadm_pd_cdts [NVC*12-1:0] : 
available VCO - VCn data posted credits xadm_npd_cdts 
[NVC*12-1:0] : available VCO - VCn data non-posted credits 
xadm_cpld_cdts [NVC*12-1:0] : available VCO - VCn data 
completion credits Informatin for lower order VCs is presented on 
the lower-order bits. 


Transmit Arbitration 


Vahue Defaulk? 


VC-Based: (available 5/2005) Provides VC based Arbitration 
Priority across 2 VC classes LPVC/HPVC - LPVC groups can be 
programmed to render Weighted Round Robin or Round Robin 


Priority - HPVC groups provide Strict priority Arbitration, with 
priority toward highest VIDs. 

Parameter Name : CLIENT_PULLBACK When enabled, the client 
interfaces are allowed to cancel a TLP currently submitted for 
transmission 


Client Interface 
TLP pullback 
feature 


| Value | 
Transmit 1 Parameter Name : CX_XADM_ARB_MODE Transmit Arbitration 
Arbitration Method Client-Based: Provides Round Robin Arbitration Priority, 
Method among transmit clients. Strict Pri.: Provides Strict Arbitration 
Priority, among transmit clients. Client 0 has the lowest priority. 
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isabled 
Enable LPVC Parameter Name : CX_LPVC_WRR_WEIGHT_WRITABLE Enable 
WRR Weights LPVC Weighted Round Robin Weights registers to be writable 


Writable through DBI 


VC ID #0 Weight 


VC ID #1 Weight 


Parameter Name : LPVC_WRR_WEIGHT_VCO WRR Weighting 
for VC ID #0 


for VC ID #1 
VC ID #2 Weight Parameter Name : LPVC_WRR_WEIGHT_VC2 WRR Weighting 


for VC ID #2 


VC ID #3 Weight Parameter Name : LPVC_WRR_WEIGHT_VC3 WRR Weighting 
for VC ID #3 


Oxf 
0x0 
0x0 
0x0 
VC ID #4 Weight | 0x0 Parameter Name : LPVC_WRR_WEIGHT_VC4 WRR Weighting Y 
0x0 
0x0 
0x0 


Y 

Y 
Parameter Name : LPVC_WRR_WEIGHT_VCI WRR Weighting 

Y 

Y 


for VC ID #4 
Parameter Name : LPVC_WRR_WEIGHT_VC5 WRR Weighting Y 
for VC ID #5 
Parameter Name : LPVC_WRR_WEIGHT_VC6 WRR Weighting 


Y 
evr 46 i 
VC ID #7 Weight Parameter Name : LPVC_WRR_WEIGHT_VC7 WRR Weighting | Y 


XADMPosted 


Disabled 


Special Posted Parameter Name : SPECIAL_.MAX_P_CRD_ENABLE This 

TLP Handling parameter enables user to specify the necessary credits accumulated 
before core will transmit Posted TLPs. Note: This option cannot be 
selected if Compare Posted Credit is selected. 


a 
Posted TLP Credit | 32 Parameter Name : SPECIAL_.MAX_P_CRD This parameter defines | Y Y 
Threshold the actual amount of Posted TLP credits core must accumulate 
before transmitting a posted TLP 
a 


VC ID #5 Weight 


VC ID #6 Weight 


Compare Posted Parameter Name : PLLEN_CMP_ENABLE This parameter enables 

Credit core to compare the requested posted payload length against enough 
accumulated credits before transmission Note: This option cannot 
be selected if Special Posted TLP Handling is selected. 


Transmit Completion 


Disabled 


Special Completion Parameter Name : SPECIAL_-MAX_CPL_CRD_ENABLE This 

Handling parameter enables user to specify the necessary credits accumulated 
before core will transmit the completions. Note: This option cannot 
be selected if Compare Completion Credit is selected. 


a 
Completion Credit | 32 Parameter Name : SPECIAL_-MAX_CPL_CRD This parameter Y Y 
Threshold defines the actual amount of completion credits core must 
accumulate before transmitting a completion 
Compare < Parameter Name : CPL-LEN_-CMP_ENABLE This parameter 


Completion Credit enables core to compare the requested completion length against 
enough accumulated credits before transmission Note: This option 


cannot be selected if Special Completion Handling is selected. 


Common Register Configuration 


Application Interface Options 
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Disabled 


Configuration Parameter Name : CONFIG_LIMIT Upper limit of internally 
Upper Limit handled Configuration requests. Any access to configuration register 


above this address will go to TRGT1 interface 

Default Target Parameter Name : DEFAULT_TARGET Target Interface 

Interface Destination for received TLPs which are Unsupported Requests 
Note: This feature is not applicable for RC. 

Target CPL LUT Parameter Name : TRGT_CPL_LUT_EN Let the core calculate the 

Enable correct byte count for CPL of incoming MemRd This feature is 
available only if target 1 Interface is included. Note: This feature is 
not available for Switch. 


Maximum Remote Parameter Name : CX_REMOTE_MAX_TAG Specifies the Y Y 
Tags Supported maximum number of tags track in the Target Completion LUT 


Used to size the target completion look-up-table and timeout ram. 
RADM CPL LUT Parameter Name : RADM_CPL_LUT_STORE_BYTE_CNT Store 
STORE BYTE the byte count in the RADM completion LUT 
COUNT 
Client Parameter Name : CX_CLIENT_PAR_MODE Select client 
Data/Address Bus address/data parity mode. 
Parity Protection 


Application Par Parameter Name : APP_PAR_ERR_OUT_EN Allow application to 
Error Out Enable monitor parity errors from core RAMs. 


Application Return Parameter Name : APP_RETURN_CRD_EN Allow application to 
CRD Enable directly control credit returns for each packet type. 


Port Logic Register 


Disabled 


Default Link Parameter Name : DEFAULT_LINK_NUM Default Link Number 
Number value that the EP Core advertises to the Link partner. Valid values 
are 0-255. (Only in RC/SW_DOWN mode) 


Default ACK 0x0 Parameter Name : DEFAULT_ACK_FREQUENCY The EP Core Y 
Frequency accumulates the number of pending Ack’s specified here (up to 255) 
before sending an Ack. 


Default Replay Parameter Name : DEFAULT_REPLAY_ADJ Default replay timer 
Timer Adjustment adjustment. Each value increase the replay timer by 64. 


Default L1 Entry 0x2 Parameter Name : DEFAULT_L1I_ENTR_LATENCY L1 Entrance Y 
Latency ee Latency Ef 
Default LOS Entry | 0x3 Parameter Name : DEFAULT_LOS_ENTR_LATENCY LOs Entrance | Y 
Latency | Latency ae 


MSI/MSI-X 


Disabled 


MSI Capability Ox1 Parameter Name : MSILCAP_ENABLE MSI Capability structure Y 
Enable 64-bit MSI Parameter Name : MSI_64_EN 64-bit address MSI enable Y Y 
Fe aed aa a 
Y 


Ox1 
Default Multiple 0x0 Parameter Name : DEFAULT_MULTILMSI_CAPABLE Indicates 
MSI Capability that multiple Message mode is enabled by system software. The 
number of Messages enabled must be less than or equal to the 
Multiple Message Capable value. 


MSI-X Capability Parameter Name : MSIX_CAP_ENABLE MSI-X Capability enable 
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LOS Exit Latency 0x3 


LOS Exit Latency 0x3 
(common clk) 

L1 Exit Latency 
L1 Exit Latency 

(common clk) 
[aac (odl 


Use Platform 


Reference Clock 


Physical Slot 
Slot Power Limit 
Slot Power Limit 
Value 

Slot is Hot-Plug 
Capable 


Slot Support 
Hot-Plug Surprise 


Ox1 


Disable Hot-Plug 
Software 
Notification 
Electro-mechanical 
Interlock 
Implemented 

Slot Power 

Slot Attention 
Indicator Present 


Slot MRL Sensor 
Present 


Slot Power 

[Controter Preset |_| 
Slot Attention 0 
Button Present 


Ox6 
Ox0 
Ox0 
Ox0 
Oxf 
Oxl 
Ox0 
Ox0 
Oxl 
Oxl 
Ox0 
Oxl 

x0 


PCIe Extended Capabilities 


Support Advanced | Oxl 
Error Reporting 


Virtual Channel Parameter Name : VC_ENABLE Virtual Channel Capability enable Lee! 
Support 
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Description Disabled‘ 
Parameter Name : DEFAULT_LOS_EXIT_LATENCY LOs Exit sy 
Latency 


Parameter Name : DEFAULT_COMM_LOS_EXIT_LATENCY LOs 
Exit Latency when using common clock 


Name : DEFAULT_L1_EXIT_LATENCY L1 Exit 


Parameter 
Latency 
Parameter Name : DEFAULT_COMM_L1_EXIT_LATENCY LI 
Exit Latency when using common clock 

Name : PORT_NUM PCle Port number for the given 


Parameter 
PCle link 
Parameter Name : SLOT_CLK_CONFIG Slot Clock Configuration 


Indicates that the component uses the same physical reference clock 
that the platform provides on the connector. 
Parameter Name : SLOT_PHY_SLOT_NUM Physical Slot Number 


Parameter Name : SET_SLOT_PWR_LIMIT_SCALE Slot Power 
Limit Scale - Specifies the scale used for the Slot Power Limit Value 
Parameter Name : SET_SLOT_PWR_LIMIT_VAL Slot Power Limit 
Value - Upper limit of power supplied by slot 

Parameter Name : SLOT_HP_CAPABLE When set indicates that 
this slot is capable of supporting Hot-Plug operations 

Parameter Name : SLOT_HP_SURPRISE When set indicates that a 
device present in this slot might be removed from the system 
without any prior notification 

Parameter Name : SLOT_NO_CC_SUPPORT When set, it indicates 
that this slot doesn’t generate software notification when an issued 
command is completed by the Hot-Plug Controller 

Parameter Name : SLOT_EML_-PRESENT When set, it indicates 
that an Electromechanical Interlock is implemented on the chassis 
for this slot. 
Parameter Name : SLOT_.PWR_IND_PRESENT When set indicates 
that a Power Indicator is implemented on the chassis for this slot. 
Parameter Name : SLOT_ATTEN_IND_PRESENT When set 
indicates that an Attention Indicator is implemented on the chassis 
for this slot. 
Parameter Name : SLOT_MRL_SENSOR_PRESENT When set 
indicates that an MRL Sensor is implemented on the chassis for this 
slot. 
Parameter Name : SLOT._.PWR_CTRL_PRESENT When set 
indicates that a Power Controller is implemented for this slot. 
Parameter Name : SLOT_ATTEN_BUTTON_PRESENT When set 
indicates that an Attention Button is implemented on the chassis 
for this slot. 


Description Disabled‘ 
Parameter Name : AER-ENABLE Advanced Error Reporting 


Capability enable 


Y 
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[Parameter] [Default | Disabred 


Parameter —_ Number Parameter Name : SERIAL_CAP_ENABLE Device Serial Number 
Capability Capability enable 


Device Serial Parameter Name : DEFAULT_SN_DW1 Specifies the first 32-bit Y 
Number (1st DW) device serial number 
Device a Parameter Name : DEFAULT_SN_DW2 Specifies the second 32-bit Y 
Number (2nd DW) ee | device serial number 


Vital Product Data (VPD) 


Disabled’ 
VPD Capability 0x0 Parameter Name : VPD_CAP_ENABLE Vital Product Data (VPD) | Y 
Capability structure enable 


Virtual Channel Capability 


Dofaul? 


VC Arbitration Parameter Name : DEFAULT_VC_ARB_32 Types of VC 

Capability Arbitration supported by the device for the LPVC group bit 0 - 
Weighted Round Robin arbitration with 16 phases bit 1 - Weighted 
Round Robin arbitration with 32 phases bit 2 - Weighted Round 
Robin arbitration with 64 phases bit 3 - Weighted Round Robin 
arbitration with 128 phases bit 4-7 Reserved 

Low Priority Parameter Name : DEFAULT_LOW_PRILEXT_VC_CNT Indicates 

Extended VC the number of (exteded) VC in addition to the default VC belonging 

Count to the LPVC group that has the lowest priority with respect to 
other VC resources in a strict-priority VC arbitration. 


Function Configuration 


Function 0 Configuration 


Function 0 -> PCI Express Capability 


Disabled’ 
Y 


PCle Capabilities Parameter Name : PCIE_CAP_INT_MSG_NUM_0 This register 
Interrupt Message indicates which MSI/MSI-X vector is used for the interrupt message 
Number generated in association with the status bits in either the Slot 
Status register 

Parameter Name : DEFAULT_CLK_PM_CAP_0 When set indicates 
that the component tolerates the removal of any ref clk when the 
link is in the L1 and L2/3 ready states. 


Parameter Name : SLOT_IMPLEMENTED_0 When set indicates 
that the PCI Express Link associated with this Port is connected to 
a slot 


Clock PM Support 


Is Port Connected 
to Slot 


Indicates the maximum supported size of the Tag field as a 
Requester and the ability of accepting request with 8-bit tag. 
Should only be set when CXK_REMOTE_MAX_TAG is set to 256 
Parameter Name : DEFAULT_ATT_BUTT_PRE_O When set 
indicates that an Attention Button is present 

Parameter Name : DEFAULT_ATT_IND_PRE_O When set indicates 
that an Attention Indicator is present 


Support 


Add Support For 0x0 
Add Support For Ox1 
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[Parameter | Default? | Disabled 


Add Support For Parameter Name : DEFAULT_PWR_IND_PRE_O When set 
Flas ball indicates that a Power Indicator is present 
Support No-Snoop Parameter Name : DEFAULT_NO_SNOOP_SUPPORTED_0 When | Y 
set indicates that the device is permitted to set the No Snoop bit in 


the Requester Attributes of transactions it initiates that do not 
require hardware enforced cache coherency 


Active State Link 0x3 Parameter Name : AS_LINK_PM_SUPT_O Active State Power Y 
Enable Root RCB Ox0 Parameter Name : ROOT_RCB_O Indicates the RCB value for the Y 


Function 0 -> MSI-X Register Configuration 


Disabled’ 
Y 


MSIX Table Size ~ 0 Parameter Name : MSIX_TABLE_SIZE_0O MSI-X Table Size - 
Encoded as (Table Size - 1). Ee 
MSIX Table BIR 0x0 Parameter Name : MSIX_TABLE_BIR_O Table BAR Indicator Y 

Register (BIR) Indicates which BAR is used to map the MSI-X 

Table into memory space 
MSIX Table Offset Parameter Name : MSIX_TABLE_OFFSET_O Table Offset - Base Y Y 
address of the MSI-X Table, as an offset from the base address of 
the BAR indicated by the table BIR bits. 
MSIX PBA BIR Parameter Name : MSIX_PBA_BIR_0 Pending Bit Array (PBA) Y 

BIR Indicates which BAR is used to map the MSI-X PBA into 

memory space 
MSIX PBA Offset Parameter Name : MSIX_PBA_OFFSET_O PBA Offset - Base Y Y 
address of the MSI-X PBA, as an offset from the base address of the 
BAR indicated by the PBA BIR bits. 


Function 0 -> Advanced Error Register Configuration 


[Parameter | Default? | Disabled 


Default penal CRE Parameter Name : DEFAULT_ECRC_CHK_CAP_0 ECRC Y 
Default ECRC Parameter Name : DEFAULT_ECRC_GEN_CAP_0 ECRC Y 
Generation Generation Capability 
Capability 
Advanced Error Parameter Name : AER_INT_MSG_NUM_0 This register must 
Interrupt Message indicate which MSI/MSI-X vector is used for the interrupt message 


Number generated in association with any of the status bits of this capability 


Function 0 -> Power Management Register Configuration 


Disabled 
PME Support Ox1b Parameter Name : PME_SUPPORT_0 5-bit field indicates the Y 
power states in which the device may generate a PME. 


Parameter Name : D1.SUPPORT_O Supports the D1 PM state 


Parameter Name : D2.SUPPORT_O Supports the D2 PM state 


Device Specific 0x0 Parameter Name : DEV_SPEC_INIT_0 Device Specific Initialization 


Initialization 


Auxiliary Current Ox7 Parameter Name : AUX_CURRENT_O Auxillary Current Y 
requirement 
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Disabled 


No Reset on Parameter Name : DEFAULT_NO_SOFT_RESET_O When set, it Y 
D3hot->D0 indicates that this device when transitioning from D3hot to DO 


Transition because of powerstate commands don’t perform an internal reset. 


Function 0 -> PCI Register Configuration 


Description Disabled 


Device Parameter Name : CX_DEVICE_ID_0 Specifies the 16-bit device 
Identification identification number for the function. 

Number 
Vendor 0x19b2 | Parameter Name : CXK_VENDOR_ID_0 Specifies the 16-bit vendor 
Indentification identication number for the function. This value is controlled by the 
Number PCI SIG. 


Device Revision Ox1 Parameter Name : CX_REVISION_ID_O Specifies the 8-bit revision | Y 
eee J ie ee i 
Parameter Name : BASE_CLASS_CODE_O Class code Ss 
Parameter Name : SUB_CLASS_CODE_0 Sub-class code PS 


Programming 0x0 Parameter Name : IF_CODE_0 Programming Interface code Y 
[inerince Code [oe 
IO Address Decode | 0x1 Parameter Name : IO-DECODE_32_0 IO Addressing (Typel-Only) 
PO Bares Deoeds [OS | SSNOTES* Should not appear for EP Ls 
Memory Address Parameter Name : MEM_DECODE_64_0 Memory Addressing 
Pe My (Typel-Only) **NOTE** Should not appear for EP Le 
| Enable ROM BAR. | Parameter Name : ROM_BAR_-ENABLED_0 ROM BAR Enable ee. oi 
ROM BAR Mask Oxftft Parameter Name : ROM_MASK_0 ROM BAR Mask ex: 327hFFFF | Y iY 
breil il = BAR size of 2°16. Set to all F's to disable fe 
Allow Parameter Name : ROM_MASK_WRITABLE_O When set enables Y 
Reprogramming of dynamic changing of ROM BAR Mask through DBI 
ROM BAR Mask 
Specify ROM BAR Parameter Name : ROM_FUNCO_TARGET_MAP Destination of Y 
Target Interface request matching ROM BAR Note: This feature is not applicable 
for RC. 


Function 0 -> BAR_O / BAR_1 


Disable 
Parameter Name 7 BARO_ENABLED_0 BARO Enable i — | 
BAR_O is Memory Parameter Name : MEMO_SPACE_DECODER_0 BARO Memory Y Y 
favo" | Space Indicator When set indicates IO space 
BAR.0 is Parameter Name : PREFETCHABLEO0_0 BARO Memory Y Y 
Prefetchable Prefetchable When set indicates BARO Memory BAR is a 
prefetchable BAR 
Parameter Name : BARO_TYPE_0 BARO Type - 32 or 64bit Y 
Allow Parameter Name : BARO_MASK_WRITABLE_O0 When set enables | Y Y 
Reprogramming of dynamic changing of BARO Mask through DBI 
BAR_O Mask 
BAR_O Mask Oxffftf Parameter Name : BARO_MASK_0 BARO Mask ex: 64¢hFFFFF = | Y Y 
Specify Target Parameter Name : MEM_FUNCO_BARO_TARGET_MAP 1 - target | Y Y 
Interface for 1 intended destination for request matching function 0/ bar 0 0 — 
BAR_O target 0 intended destination for request matching function 0/ bar 0 
Note: This feature is not applicable for RC. 
[Enable BAR| 0x0 [ Parameter Name + BARI-ENABLED_0 BART Enable | ¥ | ¥ 
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Disable 
BAR_1 is Memory | 0x0 Parameter Name : MEM1_SPACE_DECODER_0 BAR1 Memory Y Y 
mie OS | Space nist When set indents 1Ospace | 
BAR_1 is 0x0 Parameter Name : PREFETCHABLE1_0 BAR1 Memory Y: Y 
Prefetchable Prefetchable When set indicates BAR1 Memory BAR is a 
prefetchable BAR 
Y 
Y 


Allow Parameter Name : BARI_-MASK_WRITABLE_0 When set enables | Y 
Reprogramming of dynamic changing of BAR1 Mask through DBI 


BAR_1 Mask 


BAR_1 Mask Oxfffffff | Parameter Name : BARI_MASK_0 BARI Mask ex: 64’hFFFFF = Y 
BAR size of 2°20. 
Specify Target Ox1 Parameter Name : MEM_FUNCO_BARI_TARGET_MAP 1 - target Y 


Interface for 1 intended destination for request matching function 0/ bar 1 0 - 
BAR_1 target 0 intended destination for request matching function 0/ bar 1 
Note: This feature is not applicable for RC. 


Function 1 


isabled 
Extended Tag Parameter Name : DEFAULT_EXT_TAG_FIELD_SUPPORTED_1 | Y 
Support Indicates the maximum supported size of the Tag field as a 
Requester and the ability of accepting request with 8-bit tag. 
Should only be set when CXK_REMOTE_MAX_TAG is set to 256 


Function 1 -> PCI Express Capability: 


Function 2 


Default? | Disabled 


Extended Tag Parameter Name : DEFAULT_EXT_TAG_FIELD_SUPPORTED_2 | Y 
Support Indicates the maximum supported size of the Tag field as a 

Requester and the ability of accepting request with 8-bit tag. 

Should only be set when CXK_REMOTE_MAX_TAG is set to 256 


Function 2 -> PCI Express Capability: 


Function 3 


Disabled 


Extended Tag Parameter Name : DEFAULT_EXT_TAG_FIELD_SUPPORTED_3 Y Y 


Support Indicates the maximum supported size of the Tag field as a 
Requester and the ability of accepting request with 8-bit tag. 
Should only be set when CXK_REMOTE_MAX_TAG is set to 256 


Function 3 -> PCI Express Capability: 


Function 4 
Default? | Disabled 
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Default? | Disabled 


Extended Tag Parameter Name : DEFAULT_EXT_TAG_FIELD_SUPPORTED_4 | Y 
Support Indicates the maximum supported size of the Tag field as a 


Requester and the ability of accepting request with 8-bit tag. 
Should only be set when CXK_REMOTE_MAX_TAG is set to 256 


Function 4 -> PCI Express Capability: 


Function 5 


isabled 
Extended Tag Parameter Name : DEFAULT_EXT_TAG_FIELD_SUPPORTED_5 | Y 
Support Indicates the maximum supported size of the Tag field as a 
Requester and the ability of accepting request with 8-bit tag. 
Should only be set when CXK_REMOTE_MAX_TAG is set to 256 


Function 5 -> PCI Express Capability: 


Function 6 


isabled 


Extended Tag Parameter Name : DEFAULT_EXT_TAG_FIELD_SUPPORTED_6 

Support Indicates the maximum supported size of the Tag field as a 
Requester and the ability of accepting request with 8-bit tag. 
Should only be set when CXK_REMOTE_MAX_TAG is set to 256 


Function 6 -> PCI Express Capability: 


Function 7 


Disabled 


Extended Tag Parameter Name : DEFAULT_EXT_TAG_FIELD_SUPPORTED_7 | Y Y 
Support Indicates the maximum supported size of the Tag field as a 

Requester and the ability of accepting request with 8-bit tag. 

Should only be set when CXK_REMOTE_MAX_TAG is set to 256 


Function 7 -> PCI Express Capability: 


Filter Configuration 


Default? 


FLT_Q_ADDR_WID Parameter Name : FLT_Q.-ADDR_WIDTH number of bits for Filter | Y 
field FLT_Q_ADDR 


Gil 
Allow AER 0x0 Parameter Name : CX_MASK_UR_CA_4_TRGT1 1 - Allow AER 
(UR/CA Error) for (UR/CA error) for TLPs destined for Trgt1 0 - Suppressed AER 


TLPs Destined for (UR/CA error) for TLPs destined for Trgt1 

Target 1 

FLT Message Drop Parameter Name : FLT_DROP_MSG Control whether or not 
messages are passed along to the application or consumed by the 
core. 
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Queuing & Buffer Configuration 
Queue Depth Worksheet 


Disabled 


Enable Auto Size Parameter Name : CX_RBUF_AUTOSIZE Switch ON / OFF 

of Retry Buffer automatic retry buffer sizing. When ON the retry buffer size is 
derived from the Maximum Payload Size, the Link Width and the 
core latencies. The SOTBUF Buffer size will be calucated from 
these same criteria. When OFF the retry buffer size must be 
specified by the user by entering this sizes directly in RBUF depth, 


and SOTBUF depth. 


MAC Tx Delay Parameter Name : CX_PHY_TX_DELAY_MAC Transmitter delay 
MAC) in clock cycles 


PHY Tx Delay Parameter Name : CXK_PHY_TX_DELAY_PHY Transmitter delay 
PHY) in clock cycles 

MAC Rx Delay 4 Parameter Name : CX_PHY_RX_DELAY_MAC Receiver delay 
MAC) in clock cycles 


PHY Rx Delay Parameter Name : CX_PHY_RX_DELAY_PHY Receiver delay 
PHY) in clock cycles 


Internal Delay / Parameter Name : CXK_INTERNAL_DELAY The internal 
Link Partner Delay processing delays for received TLPs and transmitted DLLPs. This 
value is used to caclculate Retry buffer and SOTBUF buffer sizes. 


Y 
Y 
Y 


Default? 
Y 


Retry Buffer Depth | 215 Parameter Name : CX_RBUF_DEPTH Number of locations in 
Retry Buffer eunnbe Name : RBUF_WIDTH Width of Retry Buffer RAM Y 


Retry Buffer Configuration: 


Default? | Disabled 


Minimum SOT Parameter Name : CX_SOTBUF_DEPTH Minimum Number of 
Depth RAM entries per packet. Actual sotbuf depth is adjusted to be at 
least 32, and will be rounded up to the next power-of-2. 


SOT Buffer Depth | 32 Parameter Name : SOTBUF_DEPTH Number of locations in Y Y 
SOT Buffer Width Parameter Name : SOTBUF_WIDTH Width of SOTBUF RAM Y Y 
ere ere | (number of address bits) 


SOT Buffer Configuration: 


General Configuration 


Value Disabled’ 


May 14, 2014 799 Rev 51328 


SiCortex Confidential CHAPTER 13. PCI EXPRESS SUBSYSTEM 


Default? 


Specify Queue Parameter Name : CX_RADMQ_MODE There are two Queue mode 

Mode supported: Multi-Q mode: Queue’s are separated based into 
individual TLP queues. Single-Q mode: Queues that are not 
bypassed, will be combined into a single header queue, and a single 
data queue. The Posted Queue is the ‘host’ queue used as the Single 
Queue, therefore single qmode is not supported if posted queue is 
bypassed. Segment Buffer: (available in an upcoming release) 
Queues that are not bypassed are located on a single RAM but are 
functionally treated as separate queues. 

Inhibit RAM read Parameter Name : CX_RADM_ADDR_COMP Inhibits the ram’s 

enable when read enable when the read and write addresses are equal. Turning 

segment empty this option off will improve timing but may not be supported by 
some ram implementations. NOTE: The core only requires that the 
write data be written to the ram in this situation. The read data is 
not used and can be x’s. 


Receive VC Parameter Name : CX_RADM_STRICT_VC_PRIORITY 


Arbitration Arbitration ee VC. If set to strict VC Priority, VCO is lowest 
priority, VC7 is highest 

Support Relaxed Parameter Name : RELAXED_ORDER_SUPPORT Relaxed Order 
eee ey Support When set allows CPL types to go out of order 

Enable Support for Parameter Name : CUT_THROUGH_INVOLVED 

Cut-Through 

Mode 

Enable Passing of Parameter Name : ECRC_ERR_PASS_THROUGH 

ECRC Values to 

the Application 

Enable Dynamic 1 Parameter Name : CX_DYNAMIC_FC_CREDIT 

FC Credit 

Adjustment 


Enable Dynamic Q Parameter Name : CX_DYNAMIC_SEG_SIZE 
Depth Adjustment 


PCle Ordering Parameter Name : CLUMP_SUPPORT PCIe Ordering Rules 

Rules Support support This option enables support for the PCle Ordering Rules 
arbitration mode. If this option is not set, PCIe Ordering Rule 
based Arbitration will not be available. 


Segment Buffer Options: 


[Description aia? 


Posted Q Use Parameter Name : CX_RADMQ_P_NB_ORDER_LIST If Posted 
Ordering FIFO TLP Queues are not bypassed, this parameter provides a switch to 
control whether the order fifo effects Posted queue operations. If the 


bit is set to 1, presentation of received posted TLPs is controlled by 
the Order FIFO. If the bit is set to 0, presentation of received 


posted TLPs will not be influenced by the Order FIFO. 
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Dafaul? 


Non-Posted Q Use Parameter Name : CX_RADMQ_NP_NB_ORDER_LIST If 

Ordering FIFO Non-Posted TLP Queues are not bypassed, this parameter provides 
a switch to control whether the order fifo effects Non-Posted queue 
operations. If the bit is set to 1, presentation of received 
Non-Posted TLPs is controlled by the Order FIFO. If the bit is set 
to 0, presentation of received Non-Posted TLPs will not be 
influenced by the Order FIFO. 

Completion Q Use Parameter Name : CX_RADMQ_CPL_NB_ORDER_LIST If 

Ordering FIFO Completion TLP Queues are not bypassed, this parameter provides 
a switch to control whether the order fifo effects Completion queue 
operations. If the bit is set to 1, presentation of received 
Completion TLPs is controlled by the Order FIFO. If the bit is set 
to 0, presentation of received Completion TLPs will not be 
influenced by the Order FIFO. 


Multi Queue Options: 


VC Configuration 
In Single Queue and Multi-queue mode these settings are for ALL VC’s 


Posted Advertised Credits 


[Parameter] Default? [ Disabled 


rameter __ Parameter Name : RADM_P_QMODE_VCO Posted TLP queue “Y; 
type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no Posted 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 

P TLP’s are stored into queue, advertisment of an available TLP is 
advertised only after the entire TLP is stored into the queue. 
Cut-Through: P TLP’s are stored into queue and presented to the 
La at the same time it is being stored into the queue. 


Parameter Name : RADM_PQ_HCRD_VCO Specifies the # of 
Ee eee Hdr Credits to Advertise. 

a 105 Parameter Name : RADM_PQ_DCRD_VCO Specifies the # of 
Posted Data Credits to Advertise. One data credit = 128 bits of 
data 


Non-Posted Advertised Credits 


Dofaul? 


Parameter Name : RADM_NP_QMODE_VCO0O Non-Posted TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no NP 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 


NP TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: NP TLP’s are stored into queue and presented to the 
application at the same time it is being stored into the queue. 
Parameter Name : RADM_NPQ_HCRD_VCO Specifies the # of 
Non-Posted Hdr Credits to Advertise. 
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Disabled 
Data 16 Parameter Name : RADM_NPQ_DCRD_VCO Specifies the # of 
Non-Posted Data Credits to Advertise. One data credit = 128 bits 
of data 


Completion Advertised Credits 


Default? | Disabled 


Mode Parameter Name : RADM_CPL_QMODE_VCO0 Completion TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no CPL 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
CPL TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: CPL TLP’s are stored into queue and presented to 
TTP application at the same time it is being stored into the queue. 


Parameter Name : RADM_CPLQ_HCRD_VCO Specifies the # of 
Re lees Hdr Credits to Advertise. 

oe! Parameter Name : RADM_CPLQ_DCRD_VCO Specifies the # of 
Completion Data Credits to Advertise. One data credit = 128 bits 
of data 


Additional VC 0 Options 


Disabled 


Receive Parameter Name : CX_RADM_ORDERING_RULES_VCO 

Arbitration Arbitration between transaction types (P/NP/CPL). If set to strict 

Betweeen Types priority, P is higher than CPL is higher than NP Otherwise, it’s set 
to follow PCle spec, Table 2-23 ordering rules 

Decouple Depth Parameter Name : RADM_DEPTH_DECOUPLE_VCO Selecting 

from Credit this option allow RAM depths to be specified independently from 
the advertised credits. 


Posted Buffer Depth 


Disabled 


Hdr Parameter Name : RADM_PQ_HDP_VCO Specifies the depth of the 
|_| posted Hae Quene/RAM 
Data 211 Parameter Name : RADM_PQ_DDP_VCO Specifies the depth of the | Y 


Non-Posted Buffer Depth 


Disabled 


Hdr Parameter Name : RADM_NPQ_HDP_VCO Specifies the depth of 
the Non-Posted Hdr Queue/RAM. 


Data 33 Parameter Name : RADM_NPQ_DDP_VCO Specifies the depth of Y 
the Non-Posted Data Queue/RAM. 


Completion Buffer Depth 
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Parameter Name : RADM_CPLQ_HDP_VCO Specifies the depth of 
the Completion Hdr Queue/RAM. 


Parameter Name : RADM_CPLQ_DDP_VCO Specifies the depth of 
the Completion Data Queue/RAM. 


| Parameter vane 


Posted Advertised Credits: 


Parameter Value 


— 


dl 


|Description ees isabled’ 


Parameter Name : A P_QMODE_VC1 Posted TLP queue 
type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no Posted 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
P TLP’s are stored into queue, advertisment of an available TLP is 
advertised only after the entire TLP is stored into the queue. 
Cut-Through: P TLP’s are stored into queue and presented to the 


PP at the same time it is being stored into the queue. 
Parameter Name : RADM_PQ_HCRD_VC1 Specifies the # of 
ee Hdr Credits to Advertise. 

Parameter Name : RADM_PQ_DCRD_VC1 Specifies the # of 
Posted Data Credits to Advertise. One data credit = 128 bits of 
data 


|Description ——eeeseses—<—sssSSSSSS isabled‘ 
Parameter Name : RADM_NP_QMODE_VC1 Non-Posted TLP 

queue type. There are three Queue types available 

Bypass/Store-Forward/CutThrough. Bypass: There is no NP 

receive queue in this mode, the application must be able to accept 

all traffic - as back-pressure is disabled in the mode. Store-Forward: 

NP TLP’s are stored into queue, advertisment of an available TLP 

is advertised only after the entire TLP is stored into the queue. 


Cut-Through: NP TLP’s are stored into queue and presented to the 
application at the same time it is being stored into the queue. 
Parameter Name : RADM_NPQ_HCRD_VC1 Specifies the # of 
Non-Posted Hdr Credits to Advertise. 

Parameter Name : RADM_NPQ_DCRD_VC1 Specifies the # of 
Non-Posted Data Credits to Advertise. One data credit = 128 bits 
of data 


Non-Posted Advertised Credits: 


Disabled 
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Default? [ Disabred 


Mode Parameter Name : RADM_CPL_QMODE_VC1 Completion TLP Y 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no CPL 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 

CPL TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: CPL TLP’s are stored into queue and presented to 
ore application at the same time it is being stored into the queue. 


Parameter Name : RADM_CPLQ_HCRD_VCI Specifies the # of Y 
Pee Hdr Credits to Advertise. 

a Parameter Name : RADM_CPLQ_DCRD_VC1 Specifies the # of Y 
Completion Data Credits to Advertise. One data credit = 128 bits 
of data 


Completion Advertised Credits: 


isabled 


Receive Parameter Name : CX_RADM_ORDERING_RULES_VC1 

Arbitration Arbitration between transaction types (P/NP/CPL). If set to strict 

Betweeen Types priority, P is higher than CPL is higher than NP Otherwise, it’s set 
to follow PCle spec, Table 2-23 ordering rules 

Decouple Depth Parameter Name : RADM_DEPTH_DECOUPLE_VC1 Selecting 

from Credit this option allow RAM depths to be specified independantly from 
the advertised credits. 


Additional VC 1 Options: 


Disabled 


Hdr Parameter Name : RADM_PQ_HDP_VC1 Specifies the depth of the | Y Y 
Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_PQ_DDP_VC1 Specifies the depth of the | Y Y 
Posted Data Queue/RAM. 


Posted Buffer Depth: 


Disabled 


Hdr Parameter Name : RADM_NPQ_HDP_VC1 Specifies the depth of Y Y 
the Non-Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_NPQ_DDP_VC1 Specifies the depth of Y Y 
the Non-Posted Data Queue/RAM. 


Non-Posted Buffer Depth: 


Disabled 
Hdr Parameter Name : RADM_CPLQ_HDP_VC1 Specifies the depth of | Y Y 
ee ie the Completion Hdr Queue/RAM. 
Data Parameter Name : RADM_CPLQ_DDP_VC1 Specifies the depth of Y 
Iie | cea the Completion Data Queue/RAM. 
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Completion Buffer Depth: 


VC 2 


[Description S—COCCY isabled 


Parameter Name : aT P_QMODE_VC2 Posted TLP queue 
type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no Posted 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
P TLP’s are stored into queue, advertisment of an available TLP is 


advertised only after the entire TLP is stored into the queue. 
Cut-Through: P TLP’s are stored into queue and presented to the 
[Pee at the same time it is being stored into the queue. 
Parameter Name : RADM_PQ_HCRD_VC2 Specifies the # of 
hae Hdr Credits to Advertise. 

Parameter Name : RADM_PQ_DCRD_VC2 Specifies the # of 
Posted Data Credits to Advertise. One data credit = 128 bits of 
data 


Posted Advertised Credits: 


[Description CSCS isabled 


Mode Parameter Name : RADM_NP_QMODE_VC2 Non-Posted TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no NP 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
NP TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: NP TLP’s are stored into queue and presented to the 
Pes at the same time it is being stored into the queue. 


Parameter Name : RADM_NPQ_HCRD_VC2 Specifies the # of 
Non-Posted Hdr Credits to Advertise. 

Parameter Name : RADM_NPQ_DCRD_VC2 Specifies the # of 
Non-Posted Data Credits to Advertise. One data credit = 128 bits 
of data 


Non-Posted Advertised Credits: 


Disabled‘ 
Parameter Name : RADM_CPL_QMODE_VC2 Completion TLP Y Y 

queue type. There are three Queue types available 

Bypass/Store-Forward/CutThrough. Bypass: There is no CPL 

receive queue in this mode, the application must be able to accept 

all traffic - as back-pressure is disabled in the mode. Store-Forward: 


CPL TLP’s are stored into queue, advertisment of an available TLP 

is advertised only after the entire TLP is stored into the queue. 

Cut-Through: CPL TLP’s are stored into queue and presented to 

the application at the same time it is being stored into the queue. 

Parameter Name : RADM_CPLQ_HCRD_VC2 Specifies the # of Y Y 
Completion Hdr Credits to Advertise. 
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Disabled 
Data Parameter Name : RADM_CPLQ_DCRD_VC2 Specifies the # of Y 
Completion Data Credits to Advertise. One data credit = 128 bits 
of data 


Completion Advertised Credits: 


Default? [ Disabled 


Receive Parameter Name : CX_RADM_ORDERING_RULES_VC2 Y 
Arbitration Arbitration between transaction types (P/NP/CPL). If set to strict 
Betweeen Types priority, P is higher than CPL is higher than NP Otherwise, it’s set 

to follow PCle spec, Table 2-23 ordering rules 
Decouple Depth Parameter Name : RADM_DEPTH_DECOUPLE_VC2 Selecting Y 
from Credit this option allow RAM depths to be specified independantly from 

the advertised credits. 


Additional VC 2 Options: 


Disabled 


Hdr Parameter Name : RADM_PQ_HDP_VC2 Specifies the depth of the | Y Y 
Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_PQ_DDP_VC2 Specifies the depth of the | Y Y 
Posted Data Queue/RAM. 


Posted Buffer Depth: 


Disabled 


Hdr Parameter Name : RADM_NPQ_HDP_VC2 Specifies the depth of Y Y 
the Non-Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_NPQ_DDP_VC2 Specifies the depth of Y Y 
the Non-Posted Data Queue/RAM. 


Non-Posted Buffer Depth: 


Default? 


Hdr Parameter Name : RADM_CPLQ_HDP_VC2 Specifies the depth of 
the Completion Hdr Queue/RAM. 


Data Parameter Name : RADM_CPLQ_DDP_VC2 Specifies the depth of 
the Completion Data Queue/RAM. 


Completion Buffer Depth: 


VC 3 
Disabled 
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|Description  ——eeeses—i‘“i—<‘<‘<—iCitstststss 

Parameter Name : peep aR OMODILVUS Pa TF Gee P_QMODE_VC3 Posted TLP queue 
type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no Posted 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
P TLP’s are stored into queue, advertisment of an available TLP is 
advertised only after the entire TLP is stored into the queue. 
Cut-Through: P TLP’s are stored into queue and presented to the 
oP at the same time it is being stored into the queue. 
Parameter Name : RADM_PQ_HCRD_VC3 Specifies the # of 
Poles Hdr Credits to Advertise. 

Parameter Name : RADM_PQ_DCRD_VC3 Specifies the # of 
Posted Data Credits to Advertise. One data credit = 128 bits of 
data 


|Description ——eeeses—(—i‘“i<—<—iCSsi‘isSsSsSsSsSsSSSSY isabled' 
Parameter Name : RADM_NP_QMODE_VC3 Non-Posted TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no NP 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
NP TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: NP TLP’s are stored into queue and presented to the 
application at the same time it is being stored into the queue. 
Parameter Name : RADM_NPQ_HCRD_VC3 Specifies the # of 
Non-Posted Hdr Credits to Advertise. 

Parameter Name : RADM_NPQ_DCRD_VC3 Specifies the # of 
Non-Posted Data Credits to Advertise. One data credit = 128 bits 
of data 


Non-Posted Advertised Credits: 


Default? | Disabled 


BE 


sd 


the application at the same time it is being stored into the queue. 
Parameter Name : RADM_CPLQ_HCRD_VC3 Specifies the # of 
aa eee Hdr Credits to Advertise. 


Parameter Name : RADM_CPLQ_DCRD_VC3 Specifies the # of 
Completion Data Credits to Advertise. One data credit = 128 bits 
of data 


Parameter Name : RADM_CPL_QMODE_VC3 Completion TLP Y 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no CPL 

receive queue in this mode, the application must be able to accept 

all traffic - as back-pressure is disabled in the mode. Store-Forward: 

CPL TLP’s are stored into queue, advertisment of an available TLP 

is advertised only after the entire TLP is stored into the queue. 

Cut-Through: CPL TLP’s are stored into queue and presented to 


Completion Advertised Credits: 
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Default? [ Disabled 


Receive Parameter Name : CX_RADM_ORDERING_RULES_VC3 Y 
Arbitration Arbitration between transaction types (P/NP/CPL). If set to strict 
Betweeen Types priority, P is higher than CPL is higher than NP Otherwise, it’s set 


to follow PCle spec, Table 2-23 ordering rules 
Decouple Depth Parameter Name : RADM_DEPTH_DECOUPLE_VC3 Selecting Y 
from Credit this option allow RAM depths to be specified independantly from 

the advertised credits. 


Additional VC 3 Options: 


Disabled 


Hdr Parameter Name : RADM_PQ_HDP_VC3 Specifies the depth of the | Y Y 
Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_PQ_DDP_VC3 Specifies the depth of the | Y Y 
Posted Data Queue/RAM. 


Posted Buffer Depth: 


Disabled’ 


Hdr Parameter Name : RADM_NPQ_HDP_VC3 Specifies the depth of Y Y 
the Non-Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_NPQ_DDP_VC3 Specifies the depth of Y Y 
the Non-Posted Data Queue/RAM. 


Non-Posted Buffer Depth: 


Parameter Name : RADM_CPLQ_HDP_VC3 Specifies the depth of 
the Completion Hdr Queue/RAM. 

Parameter Name : RADM_CPLQ_DDP_VC3 Specifies the depth of 
the Completion Data Queue/RAM. 


Completion Buffer Depth: 


VC 4 


Disabled 


Mode Ox1 Parameter Name : RADM_P_QMODE_VC4 Posted TLP queue Y Y 
type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no Posted 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
P TLP’s are stored into queue, advertisment of an available TLP is 
advertised only after the entire TLP is stored into the queue. 
Cut-Through: P TLP’s are stored into queue and presented to the 
application at the same time it is being stored into the queue. 
(ia cal 


Parameter Name : RADM_PQ_HCRD_VC4 Specifies the # of Y 


Posted Hdr Credits to Advertise. 
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Dou? 


Data Parameter Name : RADM_PQ_DCRD_VC4 Specifies the # of 
Posted Data Credits to Advertise. One data credit = 128 bits of 
data 


Posted Advertised Credits: 


Default? 


Mode Parameter Name : RADM_NP_QMODE_VC4 Non-Posted TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no NP 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
NP TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: NP TLP’s are stored into queue and presented to the 
PRs at the same time it is being stored into the queue. 


Parameter Name : RADM_NPQ_HCRD_VC4 Specifies the # of 
Fae Seeersee Hdr Credits to Advertise. 

a Parameter Name : RADM_NPQ_DCRD_VC4 Specifies the # of 
Non-Posted Data Credits to Advertise. One data credit = 128 bits 
of data 


Non-Posted Advertised Credits: 


Default? 


Mode Parameter Name : RADM_CPL._QMODE_VC4 Completion TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no CPL 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
CPL TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: CPL TLP’s are stored into queue and presented to 
TP application at the same time it is being stored into the queue. 


Parameter Name : RADM_CPLQ_HCRD_VC4 Specifies the # of 
Fame eee Hdr Credits to Advertise. 

2! Parameter Name : RADM_CPLQ_DCRD_VC4 Specifies the # of 
Completion Data Credits to Advertise. One data credit = 128 bits 
of data 


Completion Advertised Credits: 


Description 


Receive Parameter Name : CX_RADM_ORDERING_RULES_VC4 
Arbitration Arbitration between transaction types (P/NP/CPL). If set to strict 
Betweeen Types priority, P is higher than CPL is higher than NP Otherwise, it’s set 


to follow PCle spec, Table 2-23 ordering rules 
Decouple Depth Parameter Name : RADM_DEPTH_DECOUPLE_VC4 Selecting 
from Credit this option allow RAM depths to be specified independantly from 
the advertised credits. 
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Additional VC 4 Options: 


|Description ees Se Disabled‘ 


Hdr Parameter Name : RADM_PQ_HDP_VC4 Specifies the depth of the Y 
S| _| posted tar Queue/RAM 

Data Parameter Name : RADM_PQ_DDP_VC4 Specifies the depth of the Y 
eee etd Posted Data Queue/RAM. 


Posted Buffer Depth: 


Parameter Value | Description 


Hdr Parameter Name : RADM_NPQ_HDP_VC4 Specifies the depth of 
the Non-Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_NPQ_DDP_VC4 Specifies the depth of 
the Non-Posted Data Queue/RAM. 


Non-Posted Buffer Depth: 


| Parameter |Description esses Disabled‘ 
Parameter Name : RADM_CPLQ_HDP_VC4 Specifies the depth of | Y Y 

Ri Pa a RAINE ANT SST the Completion Hdr Queue/RAM. 

Data Parameter Name : RADM_CPLQ_DDP_VC4 Specifies the depth of Y 

ele the Completion Data Queue/RAM. 


Completion Buffer Depth: 


VC 5 


Default? [ Disabled 


Mode Parameter Name : RADM_P_QMODE_VC5 Posted TLP queue Y 
type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no Posted 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
P TLP’s are stored into queue, advertisment of an available TLP is 
advertised only after the entire TLP is stored into the queue. 
Cut-Through: P TLP’s are stored into queue and presented to the 
Pee at the same time it is being stored into the queue. 


Parameter Name : RADM_PQ_HCRD_VC5 Specifies the # of Y 
eee Hdr Credits to Advertise. 

| Parameter Name : RADM_PQ_DCRD_VC5 Specifies the # of Y 
Posted Data Credits to Advertise. One data credit = 128 bits of 
data 


Posted Advertised Credits: 


Disabled’ 
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Description 


Le 

Mode Ox1 Parameter Name : RADM_NP_QMODE_VC5 Non-Posted TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no NP 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
NP TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: NP TLP’s are stored into queue and presented to the 
application at the same time it is being stored into the queue. 

bi ae (a 


Parameter Name : RADM_NPQ_HCRD_VC5 Specifies the # of 
Non-Posted Hdr Credits to Advertise. 

Data Parameter Name : RADM_NPQ_DCRD_VC5 Specifies the # of 
Non-Posted Data Credits to Advertise. One data credit = 128 bits 
of data 


Non-Posted Advertised Credits: 


isabled 


Mode Ox1 Parameter Name : RADM_CPL_QMODE_VC5 Completion TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no CPL 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
CPL TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: CPL TLP’s are stored into queue and presented to 
the application at the same time it is being stored into the queue. 

Pee 


Parameter Name : RADM_CPLQ_HCRD_VC5 Specifies the # of 
Completion Hdr Credits to Advertise. 

Data Parameter Name : RADM_CPLQ_DCRD_VC5 Specifies the # of 
Completion Data Credits to Advertise. One data credit = 128 bits 
of data 


Completion Advertised Credits: 


Disabled‘ 
Receive Parameter Name : CX_RADM_ORDERING_RULES_VC5 Y 
Arbitration Arbitration between transaction types (P/NP/CPL). If set to strict 
Betweeen Types priority, P is higher than CPL is higher than NP Otherwise, it’s set 
to follow PCle spec, Table 2-23 ordering rules 
Decouple Depth Parameter Name : RADM_DEPTH_DECOUPLE_VC5 Selecting Y 


from Credit this option allow RAM depths to be specified independantly from 
the advertised credits. 


Additional VC 5 Options: 


Disabled 


Hdr Parameter Name : RADM_PQ_HDP_VC5 Specifies the depth of the | Y na 
Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_PQ_DDP_VC5 Specifies the depth of the | Y ag 
Posted Data Queue/RAM. 
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Posted Buffer Depth: 


|Description ees PT a eee Disabled‘ 
Hdr Parameter Name : RADM_NPQ_HDP_VC5 Specifies the depth of Y 

fe oe the Non-Posted Hdr Queue/RAM. 

Data Parameter Name : RADM_NPQ_DDP_VC5 Specifies the depth of Y 
eee ted the Non-Posted Data Queue/RAM. es 


Non-Posted Buffer Depth: 


Default? | Disabled 


Hdr Parameter Name : RADM_CPLQ_HDP_VC5 Specifies the depth of | Y Y 
the Completion Hdr Queue/RAM. 


Data Parameter Name : RADM_CPLQ_DDP_VC5 Specifies the depth of | Y Y 
the Completion Data Queue/RAM. 


Completion Buffer Depth: 


VC 6 


Dofaul? 


Parameter Name : RADM_P_QMODE_VC6 Posted TLP queue 
type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no Posted 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
P TLP’s are stored into queue, advertisment of an available TLP is 
advertised only after the entire TLP is stored into the queue. 
Cut-Through: P TLP’s are stored into queue and presented to the 
Pa at the same time it is being stored into the queue. 
Parameter Name : RADM_PQ_HCRD_VC6 Specifies the # of 
ey ieee Hdr Credits to Advertise. 

Parameter Name : RADM_PQ_DCRD_VC6 Specifies the # of 
Posted Data Credits to Advertise. One data credit = 128 bits of 
data 


Posted Advertised Credits: 


Disabled‘ 
Parameter Name : RADM_NP_QMODE_VC6 Non-Posted TLP Y 

queue type. There are three Queue types available 

Bypass/Store-Forward/CutThrough. Bypass: There is no NP 

receive queue in this mode, the application must be able to accept 

all traffic - as back-pressure is disabled in the mode. Store-Forward: 


NP TLP’s are stored into queue, advertisment of an available TLP 

is advertised only after the entire TLP is stored into the queue. 

Cut-Through: NP TLP’s are stored into queue and presented to the 

application at the same time it is being stored into the queue. 

Parameter Name : RADM_NPQ_HCRD_VC6 Specifies the # of Y 
Non-Posted Hdr Credits to Advertise. 
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Disabled 
Data Parameter Name : RADM_NPQ_DCRD_VC6 Specifies the # of Y 
Non-Posted Data Credits to Advertise. One data credit = 128 bits 
of data 


Non-Posted Advertised Credits: 


Default? 


Mode Parameter Name : RADM_CPL_QMODE_VC6 Completion TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no CPL 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
CPL TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: CPL TLP’s are stored into queue and presented to 
aa application at the same time it is being stored into the queue. 


Parameter Name : RADM_CPLQ_HCRD_VC6 Specifies the # of 
Pe lee. Hdr Credits to Advertise. 

a Parameter Name : RADM_CPLQ_DCRD_VC6 Specifies the # of 
Completion Data Credits to Advertise. One data credit = 128 bits 
of data 


Completion Advertised Credits: 


Receive Parameter Name : CX_RADM_ORDERING_RULES_VC6 

Arbitration Arbitration between transaction types (P/NP/CPL). If set to strict 

Betweeen Types priority, P is higher than CPL is higher than NP Otherwise, it’s set 
to follow PCle spec, Table 2-23 ordering rules 

Decouple Depth Parameter Name : RADM_DEPTH_DECOUPLE_VCE6 Selecting 

from Credit this option allow RAM depths to be specified independantly from 
the advertised credits. 


Additional VC 6 Options: 


Disabled 


Hdr Parameter Name : RADM_PQ_HDP_VC6 Specifies the depth of the | Y Y 
Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_PQ_DDP_VC6 Specifies the depth of the | Y Y 
Posted Data Queue/RAM. 


Posted Buffer Depth: 


Disabled 
Hdr Parameter Name : RADM_NPQ_HDP_VC6 Specifies the depth of Y Y 
ea || cill| the Non-Posted Hdr Queue/RAM. 
Data Parameter Name : RADM_NPQ_DDP_VC6 Specifies the depth of Y 
ee ded the Non-Posted Data Queue/RAM. 


Non-Posted Buffer Depth: 
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Parameter Name : RADM_CPLQ_HDP_VC6 Specifies the depth of 
the Completion Hdr Queue/RAM. 


Parameter Name : RADM_CPLQ_DDP_VC6 Specifies the depth of 
the Completion Data Queue/RAM. 


Completion Buffer Depth: 


VC 7 


Deciption OCC C~*d isabled 


Parameter Name : ———— P_QMODE_VC7 Posted TLP queue 
type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no Posted 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
P TLP’s are stored into queue, advertisment of an available TLP is 
advertised only after the entire TLP is stored into the queue. 
Cut-Through: P TLP’s are stored into queue and presented to the 
[Pee at the same time it is being stored into the queue. 
Parameter Name : RADM_PQ_HCRD_VC7 Specifies the # of 
ele Hdr Credits to Advertise. 

Parameter Name : RADM_PQ_DCRD_VC7 Specifies the # of 
Posted Data Credits to Advertise. One data credit = 128 bits of 
data 


Posted Advertised Credits: 


| Parameter |Description i eesses—(i‘<i—iCitstsi‘isSsSsSsSsSsSSSY isabled‘ 


rameter Parameter Name : a UOT NP_QMODE_VC7 Non-Posted TLP 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no NP 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 
NP TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: NP TLP’s are stored into queue and presented to the 
PRs at the same time it is being stored into the queue. 


Parameter Name : RADM_NPQ_HCRD_VC7 Specifies the # of 
Non-Posted Hdr Credits to Advertise. 

ae Parameter Name : RADM_NPQ_DCRD_VC7 Specifies the # of 
Non-Posted Data Credits to Advertise. One data credit = 128 bits 
of data 


Non-Posted Advertised Credits: 


Disabled 


May 14, 2014 814 Rev 51328 


SiCortex Confidential 13.15. TRANSACTION, LINK, MAC LAYERS 


Doraule? [ Disabred 


Mode Parameter Name : RADM_CPL_QMODE_VC7 Completion TLP Y 
queue type. There are three Queue types available 
Bypass/Store-Forward/CutThrough. Bypass: There is no CPL 
receive queue in this mode, the application must be able to accept 
all traffic - as back-pressure is disabled in the mode. Store-Forward: 

CPL TLP’s are stored into queue, advertisment of an available TLP 
is advertised only after the entire TLP is stored into the queue. 
Cut-Through: CPL TLP’s are stored into queue and presented to 
oP application at the same time it is being stored into the queue. 


Parameter Name : RADM_CPLQ_HCRD_VC7 Specifies the # of Y 
i eee Hdr Credits to Advertise. 

a Parameter Name : RADM_CPLQ_DCRD_VC7 Specifies the # of Y 
Completion Data Credits to Advertise. One data credit = 128 bits 
of data 


Completion Advertised Credits: 


isabled 


Receive Parameter Name : CX_RADM_ORDERING_RULES_VC7 

Arbitration Arbitration between transaction types (P/NP/CPL). If set to strict 

Betweeen Types priority, P is higher than CPL is higher than NP Otherwise, it’s set 
to follow PCle spec, Table 2-23 ordering rules 

Decouple Depth Parameter Name : RADM_DEPTH_DECOUPLE_VC7 Selecting 

from Credit this option allow RAM depths to be specified independantly from 
the advertised credits. 


Additional VC 7 Options: 


Disabled 


Hdr Parameter Name : RADM_PQ_HDP_VC7 Specifies the depth of the | Y Y 
Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_PQ_DDP_VC7 Specifies the depth of the | Y Y 
Posted Data Queue/RAM. 


Posted Buffer Depth: 


Disabled 


Hdr Parameter Name : RADM_NPQ_HDP_VC7 Specifies the depth of Y Y 
the Non-Posted Hdr Queue/RAM. 


Data Parameter Name : RADM_NPQ_DDP_VC7 Specifies the depth of Y Y 
the Non-Posted Data Queue/RAM. 


Non-Posted Buffer Depth: 


Disabled 
Hdr Parameter Name : RADM_CPLQ_HDP_VC7 Specifies the depth of | Y Y 
ee ie the Completion Hdr Queue/RAM. 
Data Parameter Name : RADM_CPLQ_DDP_VC7 Specifies the depth of Y 
Iie | cea the Completion Data Queue/RAM. 
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Completion Buffer Depth: 


AXI Configuration 


Disabled 
AXI Enable 10 | No description available. Y 


Master Interface Options 


[Parameter] Disabled 
Y 


Master Interface Parameter Name : MASTER-POPULATED Indicates that a 
ie master interface is required 

Enable No description available. 

Independent AXI 

Master Clock 
Master 
Decomposer 
Enable 


Maximum Master 
tugs Sport | 
Remote Device Parameter Name : CX _REMOTE_RD_REQ_SIZE Specifies the 
MAX Read maximum read request size supported by the PCle core receiver 
Request Size when AXI or AHB is populated AXI Master. This parameter is 


Parameter Name : RADMX_DECOMPOSER_POPULATED 
Indicates that master interface requires a decomposer 


Parameter Name : CC_MAX_MSTR_TAG Specifies the maximum 
number of tags supported by the AXI Master. 


used to size AXI/AHB master composer memories. 
AXI Master Parameter Name : CC_MSTR_BUS_ADDR_WIDTH Specify the 
AXI Master Data 


master address width on AXI. 
Parameter Name : CC_MSTR_BUS_DATA_WIDTH Specify the 
master data width on AXI. 
Master Page Parameter Name : CC_MSTR_PAGE_BOUNDARY-_PW Specifies 
Boundary Size 


the page boundary size supported by AXI Master. No packets can 
have an address that crosses this boundary. Packets will be split to 
conform to this requirement. 


Default? [ Disabled 


Master Response’s | 4 Parameter Name : CC_XADMX_CLIENTO_QUEUE_HDP Indicates 
HEADER FIFO that bridge’s master response HEADER FIFO queue size 

Queue Depth 

Master Response’s 
DATA FIFO 
Queue Depth 


Master Request’s = Parameter Name : CC_LRADMX_DECOMPOSER_HDRQ_DP 


Parameter Name : CC_XADMX_CLIENT0O_QUEUE_DDP Indicates 
that bridge’s master response DATA FIFO queue size 


HEADER FIFO Indicates that bridge’s master request HEADER FIFO queue size 
Queue Depth 
Master Request’s 
DATA FIFO 
Queue Depth 


Parameter Name : CC_RADMX_DECOMPOSER_DATAQ_DP 
Indicates that bridge’s master request DATA FIFO queue size 


Master Queue Options: 


Slave Interface Options 


Disabled’ 
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Default? | Disabled 


Slave Interface Parameter Name : SLAVE_POPULATED Indicates that a slave Y 
Enable 


interface is required 
Enable No description available. Y 
Independent AXI 


Slave Clock | 

Slave Composer Parameter Name : RADMX_COMPOSER_POPULATED Indicates Y 
eee te. 4 that slave interface requires a composer 

Maximum Slave Parameter Name : CC_MAX_SLV_TAG Specifies the maximum Y 
‘age Supported | 


number of tags supported by the AXI Slave. 
AXI Slave Data 
Width 

Parameter Name : CC_SLV_BUS_ADDR_WIDTH Specify the slave ¥ 
Width width on AXI. 


32 
32 
32 


Parameter Name : CC_SLV_BUS_DATA_WIDTH Specify the slave Y 
AXI Slave Address 
Width address width on AXI. 


data width on AXI. 
AXI Slave ID Parameter Name : CC_SLV_BUS_ID_WIDTH Specify the slave ID Y 


services of AXI logic will ensure that the responses will be returned in order. 


Enable in order Parameter Name : SLAVE_IN-ORDER_EN Indicates that slave 
SLAVE 


Disabled 


Slave Request’s Parameter Name : CC_XADMX_CLIENT1_QUEUE_HDP Indicates | Y Y 
HEADER FIFO that bridge’s slave request HEADER FIFO queue size 
Queue Depth 

Y 


Slave Request’s 16 Parameter Name : CC_XADMX_CLIENT1_QUEUE_DDP Indicates 
DATA FIFO that bridge’s slave request DATA FIFO queue size 
Queue Depth 


Slave Queue Options: 


DBI Slave Interface Options 


Disabled 


Slave DBI Enable Parameter Name : DBIL4SLAVE_POPULATED Indicates that slave | Y Y 
interface requires DBI 
Y 


Enable No description available. 


Independent AXI 
DBI Slave Clock 


AXI DBI Slave 32 Parameter Name : CC_DBLSLV_BUS_ADDR_WIDTH Specify the | Y Y 
Address Width slave address width on AXI. 


13.16 PCS, PHY Layers 

Please reference the Synopsys’ “PCI-Express 90nm PHY Data Book”. This provides a description of the pins, 
the timing requirements, and the programmer-visible registers. 
13.17 Power Management 

The PCI-Express subsystem is active in only a fraction of the ICE9 chips on a processing module. To minimize 


power consumption, the PCI-Express subsystem must be capable of complete power-down when not in use. Support 
of intermediate power states is not required. 
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Chapter 14 


I2C Interface 


[Last Modified $Id: chipi2c.lyx 50693 2008-02-07 16:01:46Z wsnyder $] 


14.1 Overview 


The chip implements an I2C Master Controller in order to read the Serial Presence Detect (SPD) configuration 
of its local DIMMs using the industry standard I12C Bus.! This chapter provides a brief description of the I2C 
Master Controller, the registers provided to program it and the actions necessary to initialize and operate it. 


14.2 Description 


The ICE9 implementation uses the OpenCores (www.opencores.org) I2C Master Controller. The I2C core 
will be contained in the BBS unit with the other programmed I/O devices. The core need only generate 7-bit 
12C addresses and will be operated at a frequency of 100kHz.? In our implementation the I2C core will be the 
sole I2C Bus master and should never have to arbitrate for bus mastership even though the core supports it. Our 
implementation does NOT support interrupts and all mention of interrupts in the OpenCores documentation should 
be ignored. See section 14.7 for descriptions of how to poll the 12C core to determine when it is no longer busy. 
The core specification and programmer’s guide from OpenCores can be found on the WIKI at: 


http://apollo.sicortex.com/swiki/I2cInterface 
For a complete description of the I2C Bus Architecture see the Philips Semiconductors I2C Bus Specification at: 


file:///net/sicortex/system/standards/PHILIPS_I2C_spec.pdf 


14.3. Package Attributes 


Package 


chip_i2c_spec 


14.4 Registers and Definitions 


All registers in the I2C Core can be considered 8 bits wide. Although the Clock Prescale Register is internally 
16 bits wide, it is read and written in two 8 bit halves and can therefore be considered as two 8-bit registers. 
All registers described here are implemented as per the specification on the WIKI. The addressing, however, is 
somewhat different. Each address is relative to the I2C Interface’s base address. Register 0 starts at I2C_BASE + 
0, register 1 starts at I2C_BASE+8, and so on. That is, the registers appear in the address space to be 8 bytes apart 


'Also known as the Inter-Integrated Circuit Bus or I?C Bus. Throughout this document it is simply referred to as the I2C Bus. 

Since the I2C Bus is usually transferring 1-bit of serial data on its SDA line per clock, the SCL frequency is sometimes also described 
in terms of a bit rate, in bits per second, scaled appropriately as either kilobits per second (kbps) or megabits per second (Mbps). Thus 
100kHz = 100kbps. 
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even though only one byte is being transferred. For transfers within the I2C address space, the byte transferred 
is always the little-endian least significant byte of a 32-bit longword. Please note that all reserved bits are read as 
zeros. To ensure forward compatibility, they should be written as zeros. 


14.4.1 I2C Clock Prescale Register 
Description 


This register is used to prescale the I2C’s SCL clock line. The prescale register is 16 bits wide but must be 
written as two 8 bit halves, with each half at its own unique address as shown below. Due to the structure of the 
I2C interface, the core uses a 5*SCL clock internally. The prescale register must be programmed to this 5*SCL 
frequency minus 1. You may change the value of the prescale register only when the EN bit in the control register 
is cleared (disabled). 

In this implementation, the I2C core derives its SCL clock from the L2 Cache clock (CCLK). With a 16 bit 
prescale register, this implies that the SCL clock can run at any frequency from ~763 Hz to 50 MHz. However 
because I2C is an industry standard implemented by many different vendors using various processes, the I2C 
specification establishes standard maximum I2C clock frequencies of 100 kHz (normal), 400 kHz (fast) and 3.4 
MHz (high-speed). In order to support the broadest range of devices available, this implementation should operate 
at the lowest standard maximum clock frequency of 100 kHz. Therefore the value for the prescale register should 
be chosen such that the operating CCLK frequency is divided down to 100 kHz. 

The formula for calculating the prescale value is: 


cclk 


le = = 
prescate 5 ea 


Substituting our known frequency values for cclk and scl yields: 


250,000,000 
le = ————_ — 1 = 499 = 1F3(h 
prescale =~ 100,000 99 3 (hex) 


The two halves used to read and write the prescale register are as follows: 


Register 
R_I2cPrerLo 


Address 
0xE_A800_0000 


Definitions 


rasp Reserved —SSOSSSCCTCCC 


BE 0 a OxFF Low byte of I2C clock prescale register. 
Change only when EN bit of I2C Control Register is ’0’. 


Register 
R_I2cPrerHi 


Address 
0xE_A800_0008 


Definitions 
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Reserved 


7:0 prerhi RW OxFF High byte of I2C clock prescale register. 
Change only when EN bit of I2C Control Register is ’0’. 


14.4.2 I2C Control Register 


Description 


The Control Register enables I2C operation. The core responds to new commands only when the EN bit is set 
and after pending commands are finished. Clear the EN bit only when no transfer is in progress, i.e. after a STOP 
command, or when the command register has the STO bit set. If halted during a transfer, the core can hang the 
I2C Bus. 

Register 


R_I2cCtl 


Address 
0xE_A800_0010 


Definitions 


rsp Reserved 


Reeved SOS—SOSCOCCC‘~‘*r 
[7 [em | RWS_[0_[_| Rnable 80 unit. When 1, the BO widget is enabled__| 
Poot | | 0 [| [Reewed S—C—“CS*~—~s~—SC~S~S 


14.4.3 I2C Data Register 
Description 


On a write, contains next byte to send onto the 12C Bus from the master core. The byte can be either data or 
the 7-bit I2C slave address along with the read/write command. On a read, contains the last byte received from 
the I12C Bus. 


Register 
R_I2cData 


Address 
0xE_A800_0018 


Definitions 


ae ee eee (ee 
7:0 rxData R xX Last byte received from the I2C bus. 
a a 
7:0 txData WS Next byte to transmit on the I2C bus. 
a a 
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7:1 txAddr W For slave address transfers these bits represent the 7-bit 
I2C address. 
Overlaps allowed. 


txRW W For slave address transfers this bit represents the I2C 
R/W bit. 
1’ = reading from slave 
0’ = writing to slave 
Overlaps allowed. 


14.4.4 I2C Command and Status Register 
Description 


Controls the operation of the I2C Master core on write and reports its status on read. See the core specification 
on the WIKI and the transfer sequences described in this document for a more detailed description on how to use 
the bits in this register. Note that the STA, STO, RD, and WR bits are cleared automatically. These bits are 
always read as zeros. 


Register 
R_I2cCmdSts 


Attributes 


-writeonemixed 


Address 
0xE_A800_0020 


Definitions 


rast] TO Reserved SSOSOSCSOSOSOSOSCSCC‘*Y 


Generate start or repeated-start condition. 

Pe PS 
Generate stop condition. 

a 
5 rd Read data from slave. 

a a 
Write data to slave. 

a a 


W1C When acting as a receiver, send ACK (ACK= 
NACK (ACK=’1’). 
Overlaps allowed 


Reserved. Write as zero. 
Overlaps allowed 


ae Received acknowledge from slave. 
This flag represents acknowledge from the addressed slave. 
1’? = No acknowledge received 
’0’ = Acknowledge received 
Overlaps allowed 
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busy R I2C bus busy. 
Use this flag to determine when a forced stop operation 
is complete. A forced stop occurs when only the STO 
bit in the command register is set. A return value of ’0’ 
indicates the operation has completed. 
’Y’ after START signal detected. 
0’ after STOP signal detected. 
Overlaps allowed 


5:2 R Reserved 
Overlaps allowed 


1 tip R Transfer in progress. 
Use this flag to determine when a transfer is complete after 
either the RD or WR bit has been set in the Command 
Register. 
1’ when transferring data 
’0’ when transfer is complete 
Overlaps allowed 


R Reserved 
Overlaps allowed 


14.4.5 I2C Core Reset Register 
Description 


Provides a software controllable reset to the I2C core. This register is not actually part of the I2C Core logic. It 
is implemented in the CSI widget of the PMI and is used to drive the synchronous software-based reset to the 12C 
core. A write of any value to this register will assert the synchronous reset to the 12C Core for one CCLK cycle. 


Register 
R_I2cReset 


Address 
0xE_A800_0028 


Definitions 


31:0 reset WS I2C Core Reset 
A write of any value will reset the I2C core. 


14.5 Reset 


The I2C Core can be reset under both hardware and software control. The hardware reset is provided at power- 
on and under Module Service Processor control via the I2C Reset Control Bit in the Reset Control Register portion 
of the SysChain implemented in the LBS. The hardware reset asserts asynchronously and releases synchronous to 
CCLK. The ARST_LVL core parameter described in the OpenCore spec is left unchanged so that the core supports 
an active low asynchronous hardware reset. The software reset is provided by the R_I2cReset register. Writing any 
value to this register will reset the I2C Core synchronous to CCLK by asserting reset for one CCLK cycle. 
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14.6 Initialization 


For the ICE9 implementation the I2C Core exits reset synchronous to CCLK. During reset the following actions 
occur: 


e The Prescale Register is set to OxFFFF the slowest I2C clock speed available. 

e The EN bit in the Control Register is cleared, disabling the core. 

e The Transmit and Receive Data Registers are both cleared. 

e All bits in the Command Register are cleared. 

e The I2C Master Controller is placed into the idle state. 

e The [2C bus drivers are disabled, allowing the SCL and SDA wires to rise to a logic level of ’1’. 


After reset, software should perform the following operations in the order listed to prepare the core for normal 
operation: 


1. Set the Prescale Registers to the correct value for a 100kHz 12C SCL frequency. You may write the halves in 
any order, but it is probably easiest to write the MSB first and the LSB last. 


2. Set the EN bit in the Control Register. 


14.7 Transfer Sequences 


14.7.1 Example 1: Byte Writes 


Write to a slave memory device at I2C address 0x51, 1 byte of data (OxAC) to location 128 (0x80). To write 
multiple bytes; simply repeat commands 9 to 12 below, but DO NOT set the STO bit in the Command Register 
until sending the last byte. Note: Typically a slave memory device will wrap back to its first location when writing 
past the last location of the device. Extra caution should be observed when writing to a DIMM SPD 
Serial-EEPROM because of this behavoir. Also, SPD devices typically support multi-byte writes 
only up to ablock size of 16 bytes. They may wrap around to the start address after 16 bytes.® 


I2C-Sequence: 
1. Generate a START command. 
2. Send the slave device address + the write bit. 
3. Wait for an acknowledge from the slave. 
4. Write the address to be written. 
. Wait for an acknowledge from the slave. 
. Write the data to be written. 


. Wait for an acknowledge from the slave. 


ao NOD OH 


. Generate a STOP command. 


Commands: 


1. Write 0xA2 (address 0x51 left shifted 1 bit to accomodate r/w bit + write bit of ’0’) to the Transmit Data 
Register. 


2. Set the STA and WR bits in the Command Register. 
3Some Serial-EEPROM devices offer an I2C programmable write-protect feature. This feature prevents the writing of any data 


into the device without first writing a special data pattern to a specific location to unlock the device. Writing a different special data 
pattern or a different specific location will re-lock the device when finished. 
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er Ww 


eo Oa ND OH 


10. 
11. 
12. 


. Poll TIP flag in the Status Register until it is negated. 

. Read RxACK bit from the Status Register, should be ’0’. 

. Write 0x80 (address to be written, location 128 decimal) to the Transmit Data Register. 
. Set WR bit in the Command Register. 

. Poll TIP flag in the Status Register until it is negated. 

. Read RxACK bit from Status Register, should be ’0’. 


. Write OxAC (the data to be written) to the Transmit Data Register. 


Set STO and WR bits in the Command Register. 
Poll TIP flag in the Status Register until it is negated. 
Read RxACK bit from the Status Register, should be ’0’. 


14.7.2 Example 2: Byte Reads 


Read from a slave memory device at 12C address 0x51, one byte of data at location 128 (0x80). To read multiple 
bytes, simply repeat commands 13 to 15 below for each byte to be read, but DO NOT set the ACK and STO bits 
in the Command Register until reading the last byte. Note: Typically a slave memory device will wrap back to its 
first location when reading past the last location of the device. 


I2C-Sequence: 


11. 


. Generate a START command. 

. Write the slave address + write bit. 

. Receive acknowledge from the slave. 

. Write the memory address to the slave. 
. Receive acknowledge from the slave. 

. Generate a repeated START command. 
. Write the slave address + read bit. 

. Receive acknowledge from the slave. 

. Read a byte from the slave. 


. Write no acknowledge (NACK) to slave, indicating end of transfer. 


Generate stop signal. 


Commands: 


N DO oO FF WwW 


. Write 0xA2 (address 0x51 left shifted 1 bit to accomodate r/w bit + write bit of ’0’) to the Transmit Data 


Register. 


. Set the STA and WR bits in the Command Register. 

. Poll TIP flag in the Status Register until it is negated. 

. Read RxACK bit from the Status Register, should be ’0’. 

. Write 0x80 (the memory location to be read) to the Transmit Data Register. 
. Set the WR bit in the Command Register. 


. Poll TIP flag in the Status Register until it is negated. 
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8. Read RxACK bit from the Status Register, should be ’0’. 


9. Write OxA3 (address 0x51 left shifted 1 bit to accomodate r/w bit + read bit of ’1’) to the Transmit Data 
Register. 


10. Set the STA and WR bits in the Command Register. 

11. Poll TIP flag in the Status Register until it is negated. 

12. Read RxACK bit from the Status Register, should be ’0’. 

13. Set the RD bit, the ACK bit to ’1’ (NACK), and the STO bit in the Command Register. 
14. Poll TIP flag in the Status Register until it is negated. 


15. Read the byte in the Receive Data Register that was transferred over I2C from the slave memory. 


14.7.3. Example 3: Unacknowledged Transfer 


In this example, no slave acknowledges the address and the master must free the I2C bus with a stop. Assume 
that the intended slave at I2C address 0x10 fails to acknowledge its address. In this case it is necessary to generate 
a stop independent of a read or write transaction. To determine when the issued stop operation has completed, 
it is necessary to poll the BUSY bit in the Status Register in place of the TIP bit. The TIP bit does not change 
when only a STOP has been issued from the Command Register. 


I2C-Sequence: 


1. Generate a START command. 
2. Send a write to an unused slave address. 
3. Receive a no-acknowledge. 


4. Abort the operation by generating a stop signal. 
Commands: 


1. Write 0x20 (address 0x10 left shifted 1 bit to accomodate r/w bit + write bit of ’0’) to the Transmit Register. 
2. Set the STA and WR bits in the Command Register. 

3. Poll TIP flag in the Status Register until it is negated. 

4. Read RxACK bit from the Status Register, should be ’0’ but we obtain a ’1’ (no ack). 

5. Set the STO bit in the Command Register to force a stop. 


6. Poll the BUSY flag in the Status Register until it is set to ’0’. 


It should be noted that unacknowledged transfers can also occur on data transfers between master and slave, not 
just on an address as in this example. In either case, the master must abort the operation and free the I2C bus by 
issuing a stop. In general, when commanding only a stop condition, the BUSY bit should be polled in place of the 
TIP bit to determine when the master has completed the operation. 


14.8 External Connections 


The I2C interface uses a bi-directional serial data line (SDA) and a bi-directional serial clock line (SCL) for 
data transfers. All devices connected to these two signals must have open drain or open collector outputs. Both 
lines must be pulled-up to Vdd or Vcc by external resistors. 

In the ICE9 implementation, the I2C core assumes open drain tri-state buffers for SDA and SCL will be added 
at a higher hierarchial level. Internally it uses two uni-directional signals and an output enable for each of SDA 
and SCL. Connections between the core and pins should be made according to the following figure: 
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Figure 14.1: External Connections 


scl_pad_i 


SCL 


sda_pad_o 
SDA 


sda_pad_oe_o 
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Chapter 15 


UART 


[Last Modified $Id: chipuart.lyx 50693 2008-02-07 16:01:46Z wsnyder $] 


15.1 Overview 


The chip implements a standard UART to support kernel debugging from a serial console line. This chapter 
provides a brief description of the UART, the registers provided to program the device and the actions necessary 
to initialize and operate the device. 


15.2 Differences, Bugs, and Enhancements 


15.2.1 Product and Chip Pass Differences 


1. FIX NEED IMPL: TWC9A removes the UART flow control signals. They were never used on the ICE9 
modules. 


15.3 Description 


The ICE9 implementation uses the Open Cores (www.opencores.org) 16550 UART core. This core supports 
the EIA RS232 serial line protocol and is Wishbone Bus compliant. For this application it has been modified to 
operate strictly in 8-bit mode and does not support the special debug features that were in the original core.! It 
is nearly identical in operation to the industry standard National Semiconductor 16550A with the main exceptions 
being that only the FIFO mode is supported and the scratch register is not implemented. For a full description, 
see the Open Cores specification on the WIKI at: 


http://apollo.sicortex.com/swiki/UartInterface 


The UART core will be contained in the BBS unit with the other programmed I/O devices. The UART may 
interrupt any of the six processors on the ICE9 node. The UART TX/RX data signals and RTS/CTS hardware 
flow controls are brought out to pins on the chip that may be wired to a header on the board after level conversion 
as well as to an external multiplexer on the Module Service Processor. This allows for both local and remote serial 
console access to the chip. 


15.4 Package Attributes 


Package 
chip_uart_spec 


1Or were intended to be in the original core. The most recent version from Open Cores that was available to us when we started 
had several bugs in this area. We finessed the problem by not implementing these unneeded features. 
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15.5 Registers and Definitions 


All registers in the UART are 8 bits wide and are fully described in the UART spec on the WIKI (see above). 
All registers described here are implemented as per the specification on the WIKI. The addressing, however, is 
somewhat different. Each address is relative to the UART base address. Register 0 starts at UART_BASE + 0, 
register 1 starts at UART_BASE + 8, and so on. That is, the registers appear in the address space to be 8 bytes 
apart even though only one byte is being transferred. For transfers within the UART address space, the byte 
transferred is always the least significant byte of a little-endian 64-bit word. The UART_BASE is simply the first 
address used. Table 15.1 lists all of the registers implemented in the UART. 


Table 15.1: UART Register List 


[Receiver Bur [0 [8 _[ R__| Receiver FIFO output 
Transmitter Holding Register (THR) |0  |8 =| W [| Transmit FIFO input. 


Interrupt Enable feck lee al Enable/Mask Interrupts generated by the 
UART 


Interrupt Mentification [2 [8 | R__| Get tempt miformation 


[sa ? 
Bea 
ase | 

[Tine Status ——SCSC—C~—SB dd 

ies : 
od 


| R | Modem status (unused) 
The LSB of the divisor latch. 
The MSB of the divisor latch. 


[Modem Status] 
Divisor Latch Byte 1 (LSB) 


Divisor Latch Byte 2 (MSB) 


R 


15.5.1 Baud Rate Generation using the Clock Divisor Latch 


The Divisor Latch can be accessed by setting the 7**bit of LCR to ’1’. This bit should be set back to ’0’ after 
setting the Divisor Latch in order to restore access to the other registers that occupy the same addresses. The two 
bytes of the Divisor Latch form one 16-bit register, which is internally accessed as a single number. Therefore to 
insure normal operation, both bytes of the register should always be set. The Divisor Latch is set to the default 
value of 0 on reset, which disables all serial I/O operations in order to ensure explicit setup of the register by 
software. The value in the Divisor Latch is used to determine the baud rate of the serial I/O lines as a function 
of the input clock. The value set should be equal to (system clock speed) / (16 x desired baud rate). The internal 
counter starts to work when the LSB of the Divisor Latch is written, so when setting the Divisor Latch, write the 


MSB first and the LSB last. 


In this implementation the input clock is the Level 2 Cache Clock (CCLK). The formula for computing the 
contents of the Divisor Latch (DIVL) based on the baud rate is: 


cclk 


faa OI 
sade (16 x baudrate) 


Given a CCLK of 250MHz and a baud rate of 9600, the DIVL must be: 


__ 250,000,000 


pis ee 4 607 obas1628 = 
divl = ag ,627.604 > 1,628 = 65Chex 


Table 15.3 provides various DIVL settings for standard RS232 baud rates using CCLK values of 200, 225, 250 
and 275 MHz. Note: The hexadecimal values shown reflect the DIVL values rounded to the nearest integer value. 
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Table 15.3: Divisor Latch Values for Common Baud Rates 


Baud Rate | DIVL @ DIVL @ 
200MHz 225MHz 
CCLK CCLK 


[500 | 20,883.33 S16 Tex | 234875 SBBE jes | 20,041.67 O5BAjex | 2865.83 OF ES he | 


122.07 TAs 


Since the protocol is asynchronous and the sampling of the bits is conducted during the middle of the bit time, it 
is highly immune to small differences in the clocks of the sending and receiving sides. However, no such assumption 
should be made when calculating the Divisor Latch values; these should be as precise as possible. 

A word about the round-off errors for DIVL in the baud rate table above. The checked references indicate that 
it is sufficient to maintain a baud rate clock to an accuracy of 3% (or better) of the bit time.? To account for 
possible bit rate errors at both ends of the connection a 1% tolerance figure is used. For the worst case scenario 
of 115,200 bps the ideal bit time is 8.681uS. 1% of the ideal bit time is +86.8nS; therefore any error must fall 
within this constraint. With a rounded DIVL setting of 109, the baud rate for a worst case CCLK of 200MHz is 
114,678.899 with a bit time of 8.720uS. The error is 8.720 — 8.681 = .039uS = 39nS which is well within the 1% 
constraint. 

In general; the faster the source clock, the less the susceptibility to bit rate errors due to divisor latch rounding. 
Even though higher baud rates have less tolerance for bit rate errors, in this implementation even the fastest RS232 
baud rate is orders of magnitude slower than the source clock. 


15.5.2 RX/TX Data and Divisor Latch LSB 


Description 


When read this register contains the output from the UART Receive FIFO. When written this register loads 
the input to the UART Transmit FIFO. When the 7'"bit of the Line Control Register is set to ’1’ this register 
contains the least significant byte of the 16-bit clock divisor latch. 


Register 
R_UartData 


Attributes 


-noregtestcpu_reset -kernel 


Address 
0xE_B800_0000 
2 Determining Clock Accuracy Requirements for UART Communication DALLAS/Maxim Application Note AN2141, see 


the file at /net/sicortex/system/papers/UartClockAccuracy.pdf, the TIA/EIA-232-F Standard (http://global.ihs.com), and 
http://www.seetron.com/ser_anl.htm, etc. 
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Definitions 
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ae [Reserved SSCS 


ee ql [eee] Receiver Buffer. Output from the TR Receiver FIFO. 
Overlaps allowed. 


Renal Transmitter Holding Register. Input to the UART Trans- 
mit FIFO. 
Overlaps allowed. 

divl1 RWS Divisor Latch LSB. When LCR<7>=’1’ this field con- 
tains the least significant byte of the 16-bit divisor latch. 
Overlaps allowed. 


15.5.3 Interrupt Enable Register (IER) and Divisor Latch MSB 


Description 


The IER enables the various interrupts provided by the UART. When the 7*bit of the Line Control Register 
is set to ’1’ this register contains the most significant byte of the 16-bit clock divisor latch. 


Register 
R_UartIntrEnb 


Attributes 


-kernel 


Address 
0xE_B800_0008 


Definitions 


rsp PTS Reserved SSSCSCSCSSOSSOOCCS 


7:4 Reserved. 
Pt 
Enable Modem Status Interrupt. 
ee 
rls Enable Receiver Tie Status Interrupt. 
a a 


thre Enable Pranic Holding Register Empty Interrupt. 
pep} — tt reer —— 
Enable Received Hata Available Interrupt. 
a i 


et Divisor Latch MSB. When LCR<7>=’1’ this field con- 
tains the most significant byte of the 16-bit divisor latch. 
Overlaps allowed. 


15.5.4 Interrupt Identification Register (IIR) and FIFO Control Register (FCR) 
Description 


The IIR enables the programmer to retrieve the current highest priority pending interrupt. Bit 0 indicates that 
an interrupt is pending when it’s logic ’0’. When it’s ’1’ no interrupt is pending. The FCR allows selection of the 
FIFO trigger level (the number of bytes in the FIFO required to enable the Received Data Available interrupt). 
In addition, the FIFOs can be cleared using this register. In this implementation the maximum FIFO depth is 16 
bytes for both transmit and receive FIFOs. 
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Table 15.8 lists the interrupts indicated by the intrld field along with their relative priority, source and reset 
control. 


Register 
R_UartIntrIdFifoCtrl 


Attributes 


-kernel -writeonemixed 


Address 
0xE_B800_0010 


Definitions 


31:8 Reserved. 

ed ee 
7:6 R 0x3 Reserved. 

a ae 
5:4 R 0x0 Reserved. 

i (9 
3:1 intrId R 0x0 Interrupt Id. (See Table 15.8 below) 


intrPend Interrupt Pending (active low) 
’0’ - Interrupt pending. 
1’ - Interrupt not pending. 
Overlaps Allowed. 


rxFifoTrigLvl W Receive FIFO Trigger Level. Define the Receive FIFO 
Interrupt trigger level. 
0x0’ - 1 byte 
Ox’ - 4 bytes 
0x2 - 8 bytes 
0x3’ - 14 bytes 
Overlaps Allowed. 


W (Bacall Reserved. 
Overlaps Allowed. 


2 txReset W1C Transmit FIFO Reset. Writing a ’1’ to this bit clears the 
Transmitter FIFO and resets its logic. The shift register 
is not cleared, i.e. transmitting of the current character 
continues. 

Overlaps Allowed. 

1 rxReset W1C Receive FIFO Reset. Writing a ’1’ to this bit clears the 
Receiver FIFO and resets its logic. It does not clear the 
shift register, i.e. receiving of the current character con- 
tinues. 

Overlaps Allowed. 


W Reserved. 
Overlaps Allowed. 
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Table 15.8: Interrupt ID Field Definitions 


Interrupt Type Interrupt Source Interrupt Reset Con- 
trol 


Receiver Line 


Status 


Receiver 
Available 


Timeout Indica- 
tion 


Transmitter 
Holding Register 
Empty 


Modem Status 


15.5.5 Line Control Register (LCR) 


Description 


rate FIFO trigger level reached. 


Parity, Overrun or Framing er- 
rors or Break Interrupt. ; 
tus Register. 

FIFO drops below trig- 

ger level. 
There’s at least 1 character in 
the FIFO but no character has 
been input to the FIFO or read 
from it for the last 4 character 
times. Should not occur under 
normal operation. 


Reading from the 
FIFO Receiver Data 
Register. 


Transmitter Data Register is 


Writing to the Trans- 
empty. 


mitter Data Register 
or reading the IIR. 
CTS, DSR, RI or DCD. 

Only CTS should trigger this in- 
terrupt under normal operation. 


Reading the Modem 
Status Register. 


The LCR allows the specification of the format of the asynchronous data communication used. A bit in the 
register also allows access to the Divisor Latches, which define the baud rate. Reading from the register is allowed 


to check the current settings of the communication. 


Register 
R_UartLineCtrl 


Attributes 


-kernel 


Address 
OxE_B800_0018 


Definitions 


En ae 


a 


Divisor bah Access Bit. 


—,— 
—’ = 
Y _ 


The normal registers are accessed. 


The divisor latches can be accessed. 


Break Control Bit 


breakCtrl RW it. 
0” _ oe 
"1. The seri : F 
Always leave at the reset value. 
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stickParity Stick Parity Control Bit. 
’0’ - Stick Parity disabled. 
*Y’ - If bits 3 and 4 are logic ’1’, the parity bit is trans- 
mitted and checked as logic ’0’. If bit 3 is 71’ and bit 4 is 
0’ then the parity bit is transmitted and checked as ’1’. 
Always leave at the reset value. 


4 evenParity Even Parity Select. 
0’ - Odd number of ’1’s are transmitted and checked in 
each word (data and parity combined). In other words, if 
the data has an even number of ’1’s in it, then the parity 
bit is ’1’. 
1’ - Even number of ’1’s are transmitted in each word. 
Always leave at the reset value. 

3 parityEnb Parity Enable. 
’0’ - No parity. 
1’ - Parity bit is generated on each outgoing character 
and is checked on each incoming one. 
Always leave at the reset value. 

2 stopBits Stop bits. Specify the number of generated stop bits. 
’0’ - 1 stop bit. 
*]’ - 1.5 stop bits when 5-bit character length selected and 
2 bits otherwise. 
Note: The receiver always checks the first stop bit only. 
Always leave at the reset value. 

1:0 | bitsPerChar Bits per character. Select number of bits in each charac- 
ter. 
0x0’ - 5 bits 
’Oxl’ - 6 bits 
0x2’ - 7 bits 
0x3’ - 8 bits 
Always leave at the reset value. 


15.5.6 Modem Control Register (MCR) 


Description 


The MCR allows transferring control signals to a modem connected to the UART. 


Register 
R_UartModemCtrl 


Attributes 


-kernel 


Address 
0xE_B800_0020 


Definitions 


rsp Reserved SSOSOSOSOSOSSSCCC~*? 


May 14, 2014 836 Rev 51328 


SiCortex Confidential 15.5. REGISTERS AND DEFINITIONS 


loopback Loopback Mode. 
0’ - Normal operation. 
1’ - Loopback mode. 
When in loopback mode, the Serial Output Signal 
(STX_PAD_O) is set to logic ’1’. The signal of the trans- 
mitter shift register is internally connected to the input 
of the receiver shift register. 
The following connections are made: 
DTR -> DSR 
RTS -> CTS 
Outl -> RI 
Out2 -> DCD 
Always leave at the reset value. 


Out2. In loopback mode, connected to Data Carrier De- 
tect (DCD) input. 
Always leave at the reset value. 
Out1. In loopback mode, connected to Ring Indicator 
(RI) signal input. 
Always leave at the reset value. 
Request To Send. (RTS) Signal Control. 
0’ - RTS is ’?1’ 
‘l’ - RTS is ’0’ 
Data Terminal Ready. (DTR) Signal Control. 
0’ - DTR is ’1’ 
- DTR is 0’ 


Unused in this implementation. 


15.5.7 Line Status Register (LSR) 
Description 


The LSR provides the operational line status for the UART. The line status consists of transmitter line and 
FIFO status and receiver FIFO error, break and ready indicators. 


Register 
R_UartLineStatus 


Attributes 


-kernel 


Address 
OxE_B800_0028 


Definitions 


a a a 


Receive FIFO Error. 

1’ - At least one parity error, framing error, overrun error 
or break indications have been received and are inside the 
FIFO. The bit is cleared upon reading from the register. 
0’ - Otherwise. 
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Bit | Mnemonic 


t Transmitter Empty. 

1’ - Both the transmitter FIFO and transmitter shift reg- 
ister are empty. The bit is cleared when data is being 
written to the transmitter FIFO. 

0’ - Otherwise. 

Transmit FIFO Empty. 

- The transmitter FIFO is empty. Generates Transmit- 
ter Holding Register Empty interrupt. The bit is cleared 
when data is being written to the transmitter FIFO. 

’0’ - Otherwise. 

Break Interrupt (BI) Indicator. 

1’ - A break condition has been reached in the current 
character. The break occurs when the line is held in logic 
0 for a time of one character (start bit + data + parity 
+ stop bit). In that case, one zero character enters the 
FIFO and the UART waits for a valid start bit to receive 
next character. The bit is cleared upon reading from the 
register. Generates Receiver Line Status interrupt. 

’0’ - No break condition in the current character. 
Framing Error (FE) Indicator. 

‘1’? - The received character at the top of the FIFO did 
not have a valid stop bit. Of course, generally, it might 
be that all the following data is corrupt. The bit is cleared 
upon reading from the register. Generates Receiver Line 
Status interrupt. 

’0’ - No framing error in the current character. 

Parity Error (PE) Indicator. 

1’ - The character that is currently at the top of the FIFO 
has been received with parity error. The bit is cleared 
upon reading from the register. Generates Receiver Line 
Status interrupt. 

’0’ - No parity error in the current character. 

Overrun Error (OE) Indicator. 

‘1’ - If the Receive FIFO is full and another character 
has been received in the receiver shift register. If another 
character is starting to arrive, it will overwrite the data 
in the shift register but the FIFO will remain intact. The 
bit is cleared upon reading from the register. Generates 
Receiver Line Status interrupt. 

’0’ - No overrun state. 

Data Ready (DR) Indicator. 

’1’ - At least one character has been received and is in the 
Receive FIFO. 

’0’ - No characters in the Receive FIFO. 


o 


° 


Qo. 


oc 


i 


15.5.8 Modem Status Register (MSR) 


Description 


The MSR displays the current state of the modem control lines. 


Register 


R_UartModemStatus 
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Attributes 


-kernel 


Address 
0xE_B800_0030 


Definitions 


rst Reserved SSOSCSCSOSOSCSCSC~*' 


cded DCD Complement Input. Always 71’. 
Bee ot Or equal to Out2 in loopback mode. 
RI Complement Input. Always ’1’. 
a ee 
cdsr DSR Complement Input. Always 71’. 
jae ee ie | Or equals DTR in loopback mode. 
ccts CSR Complement Input. 
iii oe Or equals RTS in loopback mode. 


aie Data Carrier Detect. Always ’0’. 


- The DCD line has changed its state. 
’0’ - Otherwise. 
Trailing Edge of Ring Indicator. Always ’0’. 
- The ring indicator has changed state from low to 


- Otherwise. 
Delta Data Set Ready. Always ’0’. 
- If the DSR line has changed its state. 
- Otherwise. 
Delta Clear To Send. 
‘1’ - The CTS line has changed its state. 
- Otherwise. 


15.5.9 UART Enable Register 


Description 


The UART Enable Register allows software to observe the UART I/O Enable condition. This register is not 
part of the UART core but is a read-only I/O space register implemented in the Wishbone Interface (WBI) widget 
of the PMI. It is documented here because of its close affinity with UART operation. 


Register 
R_UartEnable 


Attributes 


-kernel 


Address 
0xE_B800_0040 


Definitions 
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aia] Pf Reserved 


so UART IO Enabled 
’0’ - If the UART I/O is not enabled at the chip pins. 
1’ - If the UART I/O is enabled at the chip pins. 
Settable only via the SysChain from the Module Service 
Processor. 


15.5.10 UART Reset Register 


Description 


The UART Reset Register allows software to reset the UART core. This register is not part of the UART core 
but is a write-only I/O space register implemented in the Wishbone Interface (WBI) widget of the PMI. A write 
of any value to this register will perform a reset of the UART. It is documented here because of its close affinity 
with UART operation. 


Register 
R_UartReset 


Address 
OxE_B800_0048 


Definitions 


31:0 reset Ws UART Reset. 
A write of any value resets the UART. 


15.6 Reset 


The UART Core can be reset under both hardware and software control. The hardware reset is provided at 
power-on and under Module Service Processor control via the UART Reset Bit in the SysChain’s Reset Control 
Register. The software reset is provided by the R_UartReset register. Writing any value to this register will reset 
the UART. Upon either reset, all UART registers revert to their reset default values and it is up to software to 
write them with useful values afterwards. 


15.7 Initialization 


In the ICE9 implementation, the UART core exits reset synchronous to CCLK. During reset the core performs 
the following tasks: 


e The receiver and transmitter FIFOs are cleared. 
e The receiver and transmitter shift registers are cleared. 
e The Divisor Latch register is set to 0. 
e The Line Control Register is set to 0. 
e All interrupts are disabled in the Interrupt Enable Register. 
After reset, perform the following initializations in the order listed for normal UART operation: 


1. Set the Line Control Register to the desired line control parameters. Set bit 7 to ’1’ to allow access to the 
Divisor Latches. 
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2. Set the Divisor Latches, MSB first, LSB last. 


3. Set bit 7 of LCR to ’0’ to disable access to the Divisor Latches. At this time the transmission engine starts 
working and data can be sent and received. 


4. Set the FIFO trigger level. Generally, higher trigger level values produce fewer interrupts, so setting it to 14 
bytes is recommended if the system responds fast enough. 


5. Enable desired interrupts by setting the appropriate bits in the Interrupt Enable Register. 


15.8 Interrupts 


The UART core can send an interrupt to the processors via the ICE9 interrupt logic. See the Processor Segments 
chapter in this specification for a complete description of how the processors handle this interrupt. 


To generate a UART interrupt on reception of data; first set the encoding for the Receive FIFO Trigger Level 
(rxFifoTrigLvl) in the FIFO Control Register (R_UartIntrIdFifoCtrl) to the number of bytes (1, 4, 8, or 14) to be 
buffered in the receive FIFO before an interrupt is sent; then set the Enable Receiver Data Available Interrupt (rdi) 
bit in the Interrupt Enable Register (R_UartIntrEnb). To generate a UART interrupt when sending data, set the 
Enable Transmitter Holding Register Empty Interrupt (thre) bit. To enable interrupts whenever TxCTS_L changes, 
set the Enable Modem Status Interrupt (ms) bit. 


When handling a UART interrupt, the interrupt handler should examine the Interrupt Id (intrId) bits in the 
Interrupt Identification Register (R_UartIntrIdFifoCtrl) to determine the cause of the interrupt. See Table 15.8 for 
a complete description of the Interrupt Id bits. 


15.9 External Connections 


15.9.1 Module Service Processor Enabled I/O 


In the ICE9 implementation, the UART TX and RX data lines and hardware flow control signals are brought to 
pins off-chip. All off-chip UART signals are enabled by the Module Service Processor (MSP) via a bit in a shadow 
latch on the SysChain in the LBS unit. Figure 15.1 below is a schematic that shows how the UART I/O pad on 
the chip is configured. 


If the UART is left disabled then the UART RX line and TxCTS_L output flow control is driven internally by 
the chip to a logic ’1’, causing the UART core to only see STOP bits with output flow control off, ignoring anything 
that the MSP may be writing to the line. In addition, the UART TX line and RxRTS_L input flow control are also 
disabled, allowing another ICE9 chip to drive the line. This is accomplished by using an open-drain driver with a 
hardwired input of logic ’0’ and an external pull-up on the Tx and RxRTS_L output pins. Whenever the SysChain 
UART Enable is asserted, the UART core’s outputs control the enables, allowing the driver to toggle between logic 
levels. Otherwise the driver is left disabled and external weak pull-up resistors (Rext) is used to hold the lines at 
the logic ’1’ state, effectively driving STOP bits to the MSP with input flow control off unless another ICE9 chip 
is driving the wire. 


This greatly simplifies the UART interconnect between the ICE9 chips and allows the MSP to control which 
ICE9 UART port is active. 
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Figure 15.1: UART External Connections 


(a) UART I/O Interface 


UART Rx In Pin 


pad_uart_rxdi é é pad_uart_rxd < 
UART TxCTS_L In Pin 


pad_uart_txcts_] 


pad_uart_txctsi_] 


Rext 
UART RxRTS_L Out Pin 
| 


uart_pad_rxrts_] = °0° | 


uart_pad_rxrtso_] 
te 
uart_pad_rxrtsOe : Rext 


UART Tx Out Pin 


=] 


uart_pad_txd = ’0° 


uart_pad_txdo 
0] 
oe uart_pad_txdOe 


SysChain UART Enable 


Chip Boundary 


15.9.2 RS232 Line Voltage Conversion 


Because the ICE9 supports I/O voltages of only 0 and +2.5 Volts on the UART pins, an external RS-232 line 
converter chip should be used to match voltage and logic levels to the RS-232 standard if that is desired. 
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Addressing 


[Last modified: $Id: chipaddr.lyx 43441 2007-08-17 17:38:27Z wsnyder $] 


16.1 Overview 


This chapter discusses the global address map. The ICE-9 physical address is 36 bits, split into half cached and 
half uncached IO space. This allows a maximum of 32GB of main memory. 


16.2 Differences, Bugs, and Enhancements 


16.2.1 Product and Chip Pass Differences 


1. TWC9A adds some values to the AddrBusStop enumeration to support the additional cores, bug3377 . 


16.3. Physical Address Regions 


The 36-bit CPU physical address is split into the following major regions. 


Start Address | End Address 


0x0_0000_0000 Ox7_FFFF_FFFF | 32GB Any Main memory - Cachable. There are some 
magic regions in this space, including use of 
the last 4GB for boot; see the Definitions. 


0x8_0000_0000 OxB_FFFF_FFFF | 15GB PCI-Express memory-mapped IO. The PCI 
address is {28’b0, 1’b0, cpu_addr[33:0]}. Note 
32 bit PCI devices are visible in only the first 
4GB of this region; only 64 bit devices are vis- 
ible in the final 12GB. 


(ical ere | EFFF_FFFF Pere ee eee port-mapped IO. PCI port I/O 
al = cpu_addr[31:0}. 


leet Lae FFFF_FFFF bee 32-bit PCI-Express configuration space IO. PCI con- 
fig address = {cpu_addr[27:16], 4’b0, [11:0]}. 


(=D-0000-0000_| OxD_FFFF_FFFF 


O0xE_0000_0000 | OxE_7FFF_FFFF | 2GB 32-bit Internal SCB bus registers. This space is fur- 
ther divided into 128 subsections based on the 
encoding described in AddrSubld. See 16.6.6. 


OxE_8000_0000 | OxE_FFFF_FFFF | 2GB 64-bit Internal Non-SCB registers. This space is fur- 
ther divided into 128 subsections based on the 
encoding described in AddrSubld. 
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Figure 16.1: Physical CPU to/from PCI addresses 


16.4 PCI Address Regions 


PCI has three distinct address spaces. PCI Config space and PCI port-mapped IO space are special spaces used 
for CPU generated transactions, and have no special address decodings. The 64-bit PCI Memory Space is divided 
into the following regions: 


| Start Address | Address 


| End Address | Address 


Size Access | Description 


0x0_0000_0000 Ox7_FFFF_FFFF | 32GB } Any Maps back to cachable memory, or PCI 
memory I/O registers, based on a sub- 
tractive decode in the PMI. Note only 
the low 4GB is visible to 32-bit PCI 
devices, and thus this space may have 
“holes” to insert the 32-bit devices. 


0x8_0000_0000 OxF_FFFF_FFFF | 32GB } Any Maps back to cachable memory. The 
PMI zeros PCI address bit 35 to gen- 
erate the memory address. As this re- 
gion maps all memory without I/O de- 
vice holes, it should be the DMA region 
used for all 64 bit PCI devices. 


OeT0_0000-0000 


PF Reserved 


16.4.1 Software allocation of PCI address space 


When allocating addresses for memory-mapped devices on the PCI bus, software needs to exercise caution in the 
allocation of the addresses. While prefetchable memory must support 64 bit addressing, non-prefetchable may only 
support 32 bit addressing, which limits devices to the low 4GB of the address space. This suggests the following 


policy: 


1. Allocate BARs starting with the largest request and working down. This avoids holes, as the PCI spec 


suggests. 


2. Allocate 64-bit capable memory BARs anywhere between PCI addresses 4GB and 16GB (Physical addresses 
9_0000_0000 and D_FFFF_FFFF). 
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3. Allocate 32-bit only capable memory BARs working down from FFFF_FFFF to 0. Working from the top-down 
increases the likelyhood that 32 bit DMA devices will be able to see all of memory. 


y 


. 32-bit DMA devices, if there are any, may see main memory in a window between PCI addresses 0 and the 
beginning of the first 32-bit BAR allocated in step 3. The rest of memory is inaccessable, and memory copies 
will be required for DMA to memory outside this window. (High performance devices should be 64-bit, so 
this shouldn’t matter for performance.) 


or 


. 64-bit DMA devices access main memory with PCI addresses 8_0000_0000 to F_LFFFF_FFFF, which map 
down to physical addresses 0 to 7_FFFF_FFFF. All of memory is visible in this window. 


16.5 General Behavior 


16.5.1 Access size 

Software must use the appropriately sized transaction to access registers, using the wrong size results in unpre- 
dictable behavior. See 16.3 on page 843 for which areas are 32-bit or 64-bit only. 
16.5.2 Read side effects 


Unless explicitly specified in a register definition with a “S” in the type field, reads do not have side effects. 


16.5.3 Illegal Addresses 


Access to addresses that are not implemented (either unspecified or mapping to non-existant memory) will cause 
unspecified behavior. On writes, this may include a No-Op, aliasing to other addresses, or creation of machine 
checks. On reads, this may include returning random data, aliasing to a register with read side effects, or creation 
of machine checks. However, all illegal address accesses will complete, they shall not hang. 


16.6 Registers and Definitions 


16.6.1 Package Attributes 
Package 
chip_addr_spec 


16.6.2 Definitions 
Defines 
ADDR 


Constant Definition 
32’d40 PABITS Physical Address Bits. Number of physical address bits implemente 


32’d39 IOBIT Memory/IO Bit. Address bit that selects memory versus no 
cachable IO space. 


36’h0_1fc0_0000 BOOT Processor Boot Address. First processor fetch is from this address. 
367h7_2000_0000 BOOT1_PA Scratch space for boot1 phase. 


64’ha000_0007_2000_0000 Scratch space for boot1 phase. 
Scratch space Tor boot? phase 

64’ha000_0007_2001_0000 Scratch space for boot2 phase. 
EJTAG_FASTDATA_VA | EJtag Fastdata register. 


64’ hffft fffft_ff20_0200 EJTAG_BOOT_VA EJtag Boot address. 


16.6.3. Manufacturer Enumeration 


AddrTapMfer specifies the JTAG manufacturer number in the R_SysTapIDecode and R-CpuTapIDCODE reg- 
isters. 
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Enum 


AddrTapMfer 


1Vh2c2 SICORTEX | EJTAG Manufacturer ID for SiCortex. (ID 66, bank 6.) 


16.6.4 Product Enumeration 


AddrProduct specifies the product name for the R-ScbChipRev (see 10.14.6) and R-CpuPRID registers. It is 
also used for the JTAG part number in R_SysTapIDecode and R-_CpuTapIDCODE register. 


Enum 
AddrProduct 
Attributes 


-kernel 


8’d19 ICE9 Ice9a for SCX-1000 series. Used in R_CpuPRId, 
R_ScbChipRev and R_SysTapIDecode registers. 


8’d20 ICE9_CPU0 Ice9 EJTAG for CPUO. Used in R_CpuTapIDECODE 
EJTAG UDR only. This differs from ICE9 above so that 
we may differentiate each EJTAG TAP from the SysChain 
TAP. 


peda [1CR9-CPUT_ |__| lee EITAG Tor CPU 
Psaz2 | ICE9_CPU2_[ | Tee ETAG for CPU2_————SSSSSSCS—S 
-sa23__[ICE9-CPUs |__| lee EJTAG for CPUS. 
ICE9-CPUS 
Ice9b for SCX-1000 series. Used in R_CpuPRId, 
R_ScbChipRev and R_SysTapIDecode registers. 


= 

| 

ae esac) | 

8'd27 ICE9B_CPU Ice9b EJTAG part number for CPUs. Used in 
R_CpuTapIDECODE EJTAG UDR only. This differs 
from ICE9 above so that we may differentiate each EJ- 
TAG TAP from the SysChain TAP. In ICE9B, each pro- 
cessor’s UDR is differentiated with the revision number, 
rather then a different AddrProduct encoding. 


8’'d30 TWCO9A twc9a Twice9A. Used in R_CpuPRId, R _ScbChipRev and 
R_SysTapIDecode registers. 


8’'d31 TWC9A_CPU | twce9a+ | TwiceQ9A EJTAG part number for CPUs. 


16.6.5 Address Bus Stop Numbers 


This enumeration contains the software bus stop number, used by the address assignments below, and interrupts. 
Physical stop numbers may differ without affecting software, see 7.17.10. 


Enum 
AddrBusStop 
Attributes 


-kernel 
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[tho [_COHO [|__| Cohorenco controller on odd side ——SSSCSCSCSCSCS~C~“~“~“‘“~*~*~* 
[ahr [_DMA_ [DM controls SSCS 
[72 [PS |__| 12 segment for provesor SSCS 
[ans [_Pst__ |__| 12 segment for processor SSCS 
[tnd [__PS2__ |__| 12 sexment for processor? SSS 
[ans [_Ps3__ [| 12 segment for processor ——SSSSSSSOSCSCSCSCSCSCSCSC~* 
[Pas [_PSt__ |__| 12 segment for processor’ ——SSSSSSSOSCSCSSC—~* 
[a7 [P85 [12 segment for processor——SSSSSSSSSSCSCSCSCSCSCSCSCSC~*S 
OS OS 
[79 [CORE [| Coherence controller on ven sido—SSSCSCSCSCSCSCSCSCS~*S 
TWOOA$ 
[Ei] Reserved. (Local loopback and aliasing) S—S 
—— ee! . 


_( 
Reserved. (Broadcast to all nodes, legal from COHE or COHO only) 


16.6.6 Sub-chip IDs 


The IO region is split into 128 pieces, one for each major subcomponent on the ICE9. This same encoding 
determines the upper address bits (30:24) of the control registers in each subchip, and if using the SCB, the SCB 
identifier. Furthermore, address bits (27:24) or enum bits (3:0) must match the AddrBusStop of that component. 
For example a AddrSubld of 7’h03 corresponds to SCB address OxE03xx_xxxx. 

The Clk column below indicates what clock domain that SCB slave operates on, if it has a slave. Scb performance 
counters only count cross-products correctly when comparing events in the same clock domain. 

The Events column indicates the enumberation listing performance counter event definitions. See the appropriate 
sub-chip spec for details. 


Enum 


AddrSubId 
(This table is grouped by bus stop, thus is is sorted by the lower nibble, then upper nibble.) 


Defnition 
Pro | COWO [ek [| |__| Odd Coherence Controller. 

7h10 WTIO n/a Magic address range used internally by CSW WTIO 
ewe deere reed oe | transactions. See CAC_IO_WTIOADDR define. 
|}7h20, | SIM — [n/a | ———sSYSsSYs Magic address range for simulator control only. 
PThor | DMA | eclk | DmaSobfivent | __ DMA Engine. 
Prhdi_[OCTEPSS [eck |__| TWOA$| OCLA Collector block for PSG 
Prhst__[ OCTBPS?_[ eck |__| TWC0A+ | OCLA Collector block for PST 
Prhei__[OCTEPSS [cele [| FWO9A+ | OCA Collector block for PSS 


Thi OCTBPS9 | cdk [|__| TWO9AF | OCLA Collector block for PS9 


7ho2 CPUO CpuScbEvent | Processor 0. Note all CPU encodings must be sequentially 
encoded. 


7hi2 CACO n/a L2 Cache 0. (Model directRead/directWrite access only; 
use CACLOC for registers.) 


7h22 CPU6 CpuScbEvent | TWC9A-+ | Processor 6. 
7h32 CAC6 }n/fa [| ————s| TWC9A+ | L2 Cache 6. 
7h03 CPU1 CpuScbEvent | = =|: Processor 1. 
This | CAC [nfa[_____ |__| ta Cache 
7h23 CPU7 CpuScbEvent | TWC9A+ | Processor 7. 
7h33 CAC7 jn/a | ——s| TWC9A+ | L2 Cache 7. 
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Dafnition 

a ce aS Processor 2 
pont [CAGE ofa fff Cache 
Processor 8 
Prat exes fare [Pv [2 Cache 8. 
[Thos [CPUs | pdk_| CpuScbivent |__| Procesor 3. 
a [2 Cache 3. 
Processor 0 
a 12 Cache 9. 


[Thos [OPUS | pdk_| CpiScbivent |__| Processor d 
Prato exer Pre 12 Cache & 


Prnor [OPUS | pdk_| CpuScbBvent_ |__| Processor 5 
Prat? | CAG5____[ fa [__| [12 Cache 8. 


77h08 SCBM cclk ScbScbEvent Serial Control Bus Master. (SCBM’s own internal regis- 
ters, not registers of other subchips on the SCB bus). 

7h18 PCIE PmiScbEvent ld PCI-Express PMI internal registers. (Not devices ON the 
PCI bus.) 


a a) oT 
Phas [UAT [nf | nfs |] arr. 

Pens [DDRO | del Daven [| SDRAM 

7hs8 | DDRI | delk | DarxEvent | SDRANT. 

| 7h6s8 =| OCLA ~—— [cckk [n/a [| ~————s«SX|:« On chip logic analyzer, common control block. 


7ho9 | COHE Peck | ~*'| | Even Coherence Controller 


-rhi9_| OTRBCPS6__[eelk |__| TWODAT | OCLA Tigger block for PSO 
rss | OTRBCPS7__[ eelk_[ | TWC9A+ | OCLA Trigger block for PS? 
-rn6o | OTRBCPS8_[eelk |__| TWC9AT | OCLA Trigger block for PSB 
Prn79__| OTRBCPS9__[eelk_[___| TWO9A+ | OCIA Trigger block for PSO 


PrhOA | OCTBCONE [eae [| CEA Collector block for COME 
PRIA [ OCTBCOHO[eelk[—[- 001A Calector block for COHO 
PrhZk | OTRBCCOME | eck [|] OCA Tigger block Tor COME 
P7h3A—[ OTRBCCOHO [eelk [| - 00 Trigser Block for COHO 

P7hMA | OCTRFSWI | eck [|---| 00 Collector block for FSW Tapat 
| 7h5A | OCTBFSWO |[cclkk [ [| OCLA Collector block for FSW Outputs 


PTnOB | OCTBPSO | cak [|__| OGLA Collector block for PSO 
PrhiB | OCTBPSI__[eelk |__| | OGLA Collector block for PSI 
Prn2B | OCTBPS2___[eelk |__| _____| OCLA Collector block for PS2 
-rhaB_| OCTBPS3___[eelk_| |__| OLA Collector block for PS3 
Prn4B__[ OCTBPS4 | eelk_[ | ______| OOLA Collector block for PSA 
[rhs | OCTBPS5 —[eclk_| |__| OGLA Collector block for PS5 
-rn6B__| OCTBDMA_[eelk_ [|__| ______| OCIA Collector block for DMA 
Prn7B | OCTBPMI [eek [|__| OGLA Collector block for PMI/BBS 


PThOC_[OTRBGPSO_[edk [|__| OODA Trigger block for PSO 
PThiC__[OTRBCPSI__[eclk [|__| OCA Trigger block for PSI 

Pha OTRBCPS2__[ecik |__| | OCTA Trigger block for PS? 
P7h3C__OTRBCPS3__[eclk[ |__| OCA Trigger block Tor PS3 
PThaC__[OTRBCPSI__[ecik [|__| OLA Trigger block Tor PS4 
P7h5C__[OTRBCPS5 [eck [|__| OCA Trigger block Tor PSS 
PTh6C__[OTRBCDMA | eelk [|__| OCIA Trigger block for DMA Codeword 
Prn7C__[OTRBVDMA | ecik_[______| _____| OCTA Trigger block for DMA Vector 
PThOD[FERO [sdk | FinScbvent |__| Fabric Lik 0 Receive. (via SOB) 
PhD [PERT] selk_| FrSebEvent_[ | Fabric Link I Receive. (via SCB) 
PrmaD__[FERZ | selk | FiSebEvent_[ | Fabric Link 2 Receive. (via SCB) 
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Definition 

Pras [FTO ——__[selk_[FScbEvent |__| Fabric Link 0 Transmait-(via SOB) 

PPD [FLT [sek | FiScbBivent [| Fabre Link 1 Transmit (via SCB) 

Pras FLT? [sel | FItSebBvent [| Fabric Link 2 Transmit. (via SCB) 

}7h6D =| QSC si sclk | n/a |S s Fabric Link Quad Controller. (via SCB) 

Par | FSW [sek | FewScbEvent_| [Fabri Switch (via SCB) 

ThiE CACLOC n/a n/a L2 Local Cache. Local access to control registers for Pro- 

fod cessor X by Processor X. 

}7h2E  [INTR [n/a [n/a [| ~—— “SFT _ Interrupt cycle. Local access by each processor. 

| 7h3E =| SPCL =—s | n/a [n/a {SSS s Special cycle. Local access by each processor. 

| 7hOF | OTRBCPMI [cckk | = [| ~~ [| OCLA Trigger block for PMI/CSW Bus Stop 

| 7hiF =| OTRBVFSWO | cclk [ [Ss OCCLLA Trigger block for FSW Vector Output 

| 7h2F =| OTRBVFSWI | cckk | = [| ~~ [ OCLA Trigger block for FSW Vector Input 

Phar | OTRBOFSW—[ eelk | | 0A Trigger block for FSW Codeword 

| 7h4F =| OTRBCPMH |[cckk | ~~ | ~~ | OCLA Trigger block for PMI/BBS Internals 


16.6.7 Main Memory Region 


Register 
R_Mem[0x1_FFFF_FFFF:0] 
Address 
0x0_0000_0000-0x7_FFFF_FFFF 
Attributes 

-noregtest -kernel 


31:0 | Data RW x Main Memory. Transactions to this region will be cached, 
and misses will go to the SDRAM. 


16.6.8 PCI Memory Region 


Register 
R_PciMem|[0x0_FFFF_FFFF:0] 
Address 
0x8_0000_0000-0xB_FFFF_FFFF 
Attributes 

-noregtest -kernel 


31:0 | Data RW x PCI-Express Memory. ‘Transactions to this region will 
initiate PCI-Express bus Memory reads or writes. Note 
32 bit PCI devices are visible in only the first 4GB of this 
region; only 64 bit devices are visible in the final 12GB. 


16.6.9 PCIIO Region 


Register 
R_Pcilo[0x0_3BFF_FFFF:0] 
Address 


0xC_0000_0000-0xC_EFFF_FFFF 
Attributes 
-noregtest -kernel 


31:0 | Data RW x PCI-Express IO Space. Transactions to this region will 
initiate PCI-Express bus IO reads or writes. 
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16.6.10 PCI Config Region 


Register 
R_PciConfig[0x0_03FF_FFFF:0] 
Address 
0xC_F000_0000-0xC_FFFF_FFFF 
Attributes 

-noregtest -kernel 


31:0 | Data RW x PCI-Express Config Space. Transactions to this region 
will initiate PCI-Express bus config reads or writes. 


16.6.11 Internal SCB Region 


Register 
R_IoScb[0x0_IFFF_FFFF:0] 
Address 
0xE_0000_0000-0xE_7FFF_FFFF 
Attributes 

-noregtest 


31:0 | Data RW x Internal SCB Registers. Transactions to this region go 
over the SCB bus to the appropriate sub-chip registers. 


16.6.12 Internal Non-SCB Region 


Register 

R_Io[0x1FFF_FFFF:0] 

Address 
0xE_8000_0000-0xE_FFFF_FFFF 
Attributes 

-noregtest 


31:0 | Data RW x Internal Non-SCB Registers. Transactions to this region 
go over the CSW or other busses to the appropriate sub- 
chip registers. 
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Pinout 


[Last Modified $Id: chippins.lyx 18812 2006-04-26 17:37:49Z jackson $] 


17.1 Overview 


This chapter describes the signals, drivers, and pin assignments of the SC-1000. The pinout includes the 


following major collections of signals: 
e Clocks and reset 
e 3 input and 3 output SiCortex fabric links 
e 2 DDR2 channels 


e 8-lane PCI Express port with auxiliary bus 


e Console port, serial management bus, JTAG, chip tester scan chains, etc. 


e Power and ground 


17.2 Signal List 


[Grou [Sigal | # [VOT Twe _] 
FabricO Cir  ' 


ll 
ce 
og oo 


ee 
Pmt] —[~s [or 


= ——=1 


2 per 
ss A power 
ol 
ee sea 


pr 20 power 
ss A power 


851 


Description 


Fabric 0 inbound data (port a), high differential 
Fabric 0 inbound data (port a), low differential 
Fabric 0 inbound data (port a) flow control, high 
differential 

Fabric 0 inbound data (port a) flow control, low 
differential 

| f£ ~~ Y|-Fabric 0 outbound data (port x), high differential 
Fabric 0 outbound data (port x), low differential 
Fabric 0 outbound data (port x) flow control, high 
differential 

Fabric 0 outbound data (port x) flow control, low 
differential 

1.2V fabric pad voltage 

Ground 


Fabric 1 ports b (in) and y (out), similar to fl[rt]O_* 
1.2V fabric pad voltage 
Ground 
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Description 
Pebric? 
PF tst~—~<iYs«CR tT 8 TC TEC Fabric 2 ports c (in) and z (out), similar to fi[rt|0_* 
- pk 20a power F12¥ fabric pad voltage 
Ground 
[Fabric Miscellaneous [a5] > 
| t~—“‘i‘i xh fl lf OE «Unused fabric transmit lane, high differential 
P t~—<“‘“‘iC xd UT lc TO TEC Unused fabric transmit lane, low differential 
| t~<‘“‘“iC Ach lUTl lf Ud TCs Unused fabric receive lane, high differential 
| ti‘ XLT lc Tl cs Unseed fabric receive lane, low differential 
| ssf ptevdd]6:0) =[ 7 [ A [analog | Fabric quad macro PLL voltage (filtered 2.5V) 
| ti(iwsé‘iéié*d sé pier f6:0] | 7 UT A] analog Fabric quad macro PLL reference return 
Pa ee ee ee Fabric quad macro termination reference resistor 
DDRO ees aE 
| dO_ck_h[2:0] =| 3 | O | sstl1 DDR 0 clock, high differential 
Faocae1pso) —[—3 [0] sss] DDI ek Tow dierent 
d0_dm| : 7 ht Ste DDR 0 data mask 
d0_dqs_. - [8: : DDR 0 data strobe, high differential 
ae Se a DDR 0 data strobe, low differential 
test-mode d0_dq[63:0] 64 sstl1. DDR 0 data 
overrides see 
=a-[ 
test-mode d0_cb|7:0] DDR 0 ecc (alias d1_dq[71:64]) 


overrides see 


sec. 17.3 


a DDR 0 write enable 


[Jats [1-0] tts] D0 column strobe 
es S11, DDR row strobe 

a) ve fe “ DDR 0 chip select ({3:2] NC on PCB}) 
rae) DDR bank addres 


test-mode d0o_ al ~ i 
overrides see 


sec. 17.3 


DDR 0 row and column address 


| sf cee - a DDR 0 clock enable ([3:2] NC on PCB]) 
Sea DDR 0 on-die termination control ([3:2] NC on PCB]) 
—— ities —[P0 DDR O reset 
PY DO_VREF DDR 0 reference voltage 
fs dOrext DDR. 0 termination reference resistor 
PY VDDM 2.5V DDR2 receive pad voltage 
| ss | WDDR a 1.8V DDR2 transmit pad voltage 
piss st Ground 

DDR 1 
ee ae Deva DDR 1 control, similar to d0 

test-mode di_ad DDR 1 row & column address 


overrides see 


sec. 17.3 


test-mode d1_dq[63:0] 64 

overrides see 

sec. 17.3 

test-mode d1_cb/7 

overrides see 

sec. 17.3 
ae 
| 53 


DDR 1 data 


DDR 1 ecc (alias d1_dq[71:64]) 


a a T5V DDRD receive pad voltage 
vor A [-__ power [L8V DDR? transmit pad voltage 
ss 8A power [Ground 
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PCrEpress CB | SC~*Y 
Cis PO ed 
ee 
cco. [8 [pie J 

Ppeaneltro] [8 [| 1 | pee 


pei_ref_clk_h lvds 


pei_ref_clk_l 


—<—= 
— 
cine 
-__F pcicainted | 
fread 
Fcc wren | 
fcc wret | 
a 
Pepsi 
pepsi 
a) 


PvppM___|_35 
vss 3], Balsa 
[Miscellaneous ————S—S—~d~CS 


ee 
Frsyscrartaaed | 
test-mode 


sys_uart_rts_l 
overrides see 


sec. 17.3 


sys_i2c_sda 1 
test-mode 


overrides see 


sec. 17.3 


| A 
| A | 
LO 
Eni 
| O | 
| I | cmos, pullup | 
Fel 
hel 
Oa 
| A | 


cmos, pullup 


cmos, 4mA 


T 
I 
T 
I 


sys_i2c_scl 


ee cs 
es ce 
[J sehctins | 
sche 
ee T 
ita test] 
itaantck | 
TF itaectms | 
Fite tar 
TF itaantdo T 


test-mode test_sdi[99:88] cmos, 8mA, pullup 
overrides see 
sec. 17.3 
test-mode 


overrides see 


12 
sec. 17.3 


__retcnodecen [1 T[anos pailtown | 
Frestseancen [1 T_[emos, pulldown | 
pavscikceh [1 ft[ he _| 
Psyc Pt [es 


test_sdo[99:88] cmos, 8mA, pullup 


May 14, 2014 853 


17.2. SIGNAL LIST 


Description 


PCLE transmit data, high differential 

PCI-E transmit data, low differential 

PCLE receive data, high differential 

PCI-E receive data, low differential 

PCI-E 100MHz reference clock output, high differenti: 
(also test_clk_e_h) 

PCI-E 100MHz reference clock output, low differentia 
(also test_clk_e_l) 

PCLE reference clock output buffer reference voltage 
PCI-E external reference resistor 

PCI-E module attention LED 

PCI-E module power LED 

PCI-E module power enable 

PCI-E module power good (pullup on PCB) 

PCI-E module power fault (pullup on PCB) 

PCI-E module present (pullup on PCB) 

PCI-E module reset 

2.5V PCI Express pad voltage 

Ground 


serial port transmit data (open drain output) 

serial port receive data 

serial port receiver request-to-send output (open drair 
output) 

serial port transmitter clear-to-send input 

serial management bus data (open drain output) 


serial management bus clock (open drain output) 


SiCortex test reset (pullup on PCB) 

SiCortex test clock 

SiCortex test mode select 

SiCortex test data in 

SiCortex test data out 

JTAG test reset (pullup on PCB) 

JTAG test clock 

JTAG test mode select 

JTAG test data in 

JTAG test data out 

Chip tester scan chain serial data in (NC on PCB). 
These pins either get no boundary-scan insertion or gi 
observe-only boundary-scan. 

Chip tester scan chain serial data out (NC on PCB). 
These pins either get no boundary-scan insertion or gi 
observe-only boundary-scan. 

Chip tester test mode select (bus together and pull uy 
on PCB) 

Chip tester test-mode valid (pull down on PCB) 

Chip tester scan enable (pull down on PCB) 

66.7MHz system reference clock, high differential 
66.7MHz system reference clock, low differential 
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[Group [Signal yo [| Type 
ayscelicoh I is 
Fryceiced ets 
Se 2 
Fests ets 
Presta [ts J 
rests ets 
Festi a re 
restciotiet a ets 
-— Frestpcti rs J 
pF restopetict ets 
—— Frestccelich [1 es 
pF restceottel ets 
-— Frestselic a  re 
Fresnel 
oe eee cided 

test_clk_o_l he lees 
raion 
Festive [1A analog 
Fst [A analog 
[Ff avantpiivvdd [1 [Analog 
-—— Fscdipiorn [1A [analog J 
———Feyaspvad [1 anaog J 
TF veciptirin [1a analog 
———Fvecppiivdet [1 aaog J 
-—Fvppitren [1A [analog J 
pT Fvassplievdd [analog] 
-—Fsspirin [1A [analog ——] 
test-mode sys_node[4:0] 5 I cmos 
overrides see 
sec. 17.3 
a 
Foye FT emos- ma J 
Fyne [0 eos, tm J 
cocci [0 [anos a J 
[-——Faveciempp [1A [analog 
-———Frvactempon [1A analog | 
ppt 5 a power 
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Description 
66.7MHz system reference clock, high differential 
66.7MHz system reference clock, low differential 
DDR 0 test clock, high differential (NC on PCB) 
DDR 0 test clock, low differential (NC on PCB) 
DDR 1 test clock, high differential (NC on PCB) 
DDR 1 test clock, low differential (NC on PCB) 
PCL-E test clock, high differential (NC on PCB) 
PCL-E test clock, low differential (NC on PCB) 
Processor test clock, high differential (NC on PCB) 
Processor test clock, low differential (NC on PCB) 
Cache test clock. high differential (NC on PCB) 
Cache test clock. low differential (NC on PCB) 
Fabric test clock, high differential (NC on PCB) 
Fabric test clock, low differential (NC on PCB) 
Odd-side test clock output for PLL testing, high 
differential 
Odd-side test clock output for PLL testing, low 
differential 
Odd-side test clock output buffer reference voltage 
DDR 0 domain PLL voltage (filtered 2.5V) 
DDR 0 domain PLL reference return 
DDR 1 domain PLL voltage (filtered 2.5V) 
DDR 1 domain PLL reference return 
PCI-E domain PLL voltage (filtered 2.5V) 
PCLE domain PLL reference return 
Processor domain PLL voltage (filtered 2.5V) 
Processor domain PLL reference return 
Fabric domain PLL voltage (filtered 2.5V) 
Fabric domain PLL reference return 
node number (0-26) 


Hard reset (from MSP) 

Status LED (open drain) 

Attention request (to MSP) 

Spare pin (NC on PCB). Internal connection to 
R_ScbGpio register. 

On-chip Logic Analyzer trigger output 
Temperature-sensing diode P terminal 
Temperature-sensing diode N terminal 

2.5V miscellaneous pad voltage 


sss A power [Ground 
[CorePowr——SC~=—“‘~*~*~idS TCT SCS 
CSS 80 power Ground 
-____[vppe [80 [A [power L0V core voltage 
PTOTAL [Total _——*iasef SCS 


17.3. List of Normal-Mode Signals and Their Test-Mode Overrides 


All the following act according to their normal-mode signal names except when {test_mode_en, test_mode[3:0]} 
= 5’b1000x or when the SYS TAP instruction register is set to serial-scan mode. In those cases the indicated pins 
take on the function indicated by the test signal names. The serial-scan-mode mux connection column indicates the 
order in which the individual ATPG scan chains are connected in series to provide the serial-scan-mode function 


accessed from the SYS TAP. 
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17.3. LIST OF NORMAL-MODE SIGNALS AND THEIR TEST-MODE OVERRIDES 


Several pins in the DDR PHY also take on a secondary test meaning when {test_mode_en, test_mode[3:0]} = 
5’b11000, 5’b11001, or 5’b11010. These test modes are for testing the slave DLLs in the DDR PHY. Each slave 
DLL provides 2 test clock outputs; there are 2 slave DLLs for each byte lane. I’ve taken a guess as to which of the 
dq pins will be used for this purpose, but the final decision will come from eSilicon when the DDR PHY is nearer 


completion. 


The test_sdi[99:88] & test_sdo[99:88] pins take on a third set of personalities in test mode 19 for parametric 
testing of the DDR PHY drive strength and ODT settings and impedance calibration circuit. These are listed in 
a separate table below. Because data drive onto these pins must be active during boundary scan (for parametric 
measurements), any boundary scan that is inserted on the test_sdi[99:88] & test_sdo[99:88] pins must be observe-only 


(or not have B-scan 


Normal Signal 
(chip level) 


d0_dq[12 
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inserted at all). 


Test Signal 
(chip level) 


pana] 
[etsdofo | 
[et sat] 
[et sai] | 
[eest-sdofT | 
[eest-sdol] | 
[essai] | 
[iest-sdolsT | 
[essai] | 
[iest-sdofay_| 
[et sais] | 
[estas] 
[et-sdofr | 
[iest-sdol6- | 
[eestsait] | 
[estado] | 
[et sais] | 
[iest-sdofs]__| 
[estsdoloy | 
[estsdoliol | 
et saioy | 
[et saitiol | 
[et satin] | 
[et sai] | 
[esscia] | 
[eestsaitial | 
[rest-scitt5] | 
[estate —_] 
[iest-sdofni] | 
[eestsdofia] | 
[iest-saiti7] | 
== 
[essaiito] | 
[estsaipeol | 
[iest-sdofta] | 
[estsdofial | 
[iestsaien] | 
[est saitza) | 


DDR2 PHY 
IO instance 
pin 


ey 


dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 


CO] Of rey rR] Ww) ml Nailer | oO] ot 
Wy] bo 


dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 


DDR2 PHY 
instance core 
side pin 


test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 


855 


Test-Mode 
to 


2’ndary Test Signal | 2’ndary 
Mode 


activate 


activate |} (choice of specific 
| dq’s is a guess for 
to be final- 


ized when esilicon is 


{test_mode_e 
test_mode[3:0]} now, 


ready) 
EC | 
5°d16/17 | 
5°d16/17 | 
5°d16/17 | 
5°d16/17 | 
5°d16/17 | 
5°d16/17 | 
5°d16/17 | 
5°d16/17 | 
5'd16/17 || 
5°d16/17 | 
saci fT COC C*™Y 
Psdi6/17__ |f test-dislav-Otstelk| 5aa2/23_———| 
PSa6/IT || testdlislav OL tet} 523/23 ‘| 
EC | 
saci ff CCdCCCSCSC~S™Y 
EC | 
ea | ea (Sa 
[sdi6/17 || test_dislav-O2-tstelef 5aa2/23_——_—| 
Sai6/17 | test_dlislav02-tstelke| 51d22/23 | 
EC | 
EE | 
EC | 
EC | 
EAC | 
EC | 
eae ——— J» ——__ =| 
EE | 
test_dllslav_03_tstclk 


| 


5 
5 
5 
5 test_dllslav_08_tstclk 


5°d16/17 


test_dllslav_00_tstclk 
test_dllslav_00_tstclk4Z 


5'd22/23 
5§d22/23 
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Normal Signal 
(chip level) 


d0_dq|71 


Test Signal 
(chip level) 


| test_sdi[23] 


DDR2 PHY 
IO instance 
pin 


dx_dq/71 


DDR2 PHY 
instance core 
side pin 


test_sdi[23] 


d0_dq[67 || test_sdi[24] dx_dq[67 test_sdi[24] 


d0_ad([15 
d0_ad[14 
d0_ad([12 


d0_ad 
d0_ad 
d0_ad 
d0_ad 
d0_ad 
d0_ad 
d0_ad 
d0_ad 
d0_ad 


d0_ad 
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[restsdofi] | 
[estsdotiel | 
[eest-sdoli7] | 
[eestsdofis] | 
[rest-sdoft9] | 
[estsdofzo] | 
[iest-sdofen] | 
[est sdol2a] | 
[iest-sdofea] | 
[estsdoleal | 
[restadof2a1 | 
[estsdolze) | 
[estsdole7] | 
[estsaorsl | 
[rest-saofea] | 
[iest-sdo[so] | 
[et sai] | 
[iestscifzo) | 
[estsdofsi] | 
[iest-sdols2] | 
[eta | 
[estas —_] 
[essai] | 
[etsaiso1__] 
[essai] | 
[iestsaisa] | 
[estas] —_| 
[estciisal | 
[estsdofs3] | 
[iest-sdolsa] | 
[est sais] | 
[restsdolB5] | 
[et saise] | 
[iestsdolsol | 
[iest-sdola7] | 
[estsdolss] | 
[iestsaiia7] | 
[et sass) | 


dx_ad 
dx_ad 
dx_ad 


dx_ad 
dx_ad 
dx_dq 


RY} oOo 


OLN] wl] oul a! oo] ~~ 


dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 


dx_dq|44 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 


Ol 


test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdi 
test_sdi 
test_sdo 
test_sdo 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdo 
test_sdo 
test_sdi 
test_sdo 
test_sdi 
test_sdo 
test_sdo 
test_sdo 
test_sdi[37] 
test_sdi[38] 


| test_sdi[39] test_sdi[39] 


[estsdofso] | 
[restsaio] | 
[estsdotaol | 
[rest-sdotar] | 
[eest-sdolaa] | 
[rest-saiiat] | 
[eest-saitea] | 


OUy OTF OT] OT] OF 
j=} 


test_sdo[39] 
test_sdi[40] 
test_sdo[40] 
test_sdo|41] 
test_sdo|42] 
test_sdi[41] 
test_sdi[42] 


856 


Test-Mode 


to activate 


{test_mode_e 
test_mode[3:0 


CHAPTER 17. 


2’ndary Test Signal 
(choice of specific 


| dq’s is a guess for 


to be final- 
ized when esilicon is 


now, 


ready) 


PINOUT 


2’ndary 
Mode 
activate 


| 


Psaeir ID id 
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Normal Signal 
(chip level) 


d0_dq[59] 
d0_dq[62] 


Test Signal 
(chip level) 


ear] _| 
[eessdofas] | 


[etal —_] 
[estsdotaal | 
[restates] —_| 
[rest saitao] —_] 
[estado] | 
[estsdola] | 
[restsaie7] | 
[estsaofar] | 
[esta] | 
[estsaotasl | 
[et sa] —_] 
[etsaiso—_] 
[iestsaotao] | 
[estsdolso1 | 
[essai] | 
[est-sdofsi] | 
[essai] | 
[iest-sdofsa] | 
[estsdotsa] | 
[rest-sdolsal | 
[essai] | 
[estscial | 
[est sais] | 
[estas] 
[et sais7] | 
[rests] | 
[estas] _—_] 
[estciool | 
[estsdofss1 | 
[est sdol56] | 
[et saten | 
[restsa6a] | 
[essai] | 
[eestsai6a] | 


DDR2 PHY 
IO instance 
pin 


DDR2 PHY 
instance core 
side pin 


Test-Mode 
to activate 
{test_mode_e 


test_mode[3:0 


2’ndary Test Signal 
(choice of specific 
| dq’s is a guess for 


now, to be final- 


17.3. LIST OF NORMAL-MODE SIGNALS AND THEIR TEST-MODE OVERRIDES 


2’ndary 
Mode 
activate 


ized when esilicon is 
ready) 


SC 
dda{e2]__[reisdoas] [oagi7 | __———=sid|——SS—SC—~* 


dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 


dx_dq 
dx_dq 
dx_dq 
dx_dq 


i) 
Dp 


D 
Ke) 
NO 
j=) 


di_dq[64 | test_sdo[57] dx_dq 


May 14, 2014 


[estsdolss) | 
[essai] | 
[et sait6o) | 
[estscior] | 
[et sates) —_] 


| test_sdo[59 
| test_sdo[60 
| test_sdo|61 
| test_sdo[62 
[ 
[ 


| test_sdo|63 
| test_sdo|64 


dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 


ee Dp 
bo 
w 


jon 
G 
ov) 
Q. 


dx_ad 


test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdo|[11 
test_sdo|[12 
test_sdi[17 
test_sdi 
test_sdi 
test_sdi 
test_sdo[13] 
test_sdo[14 
test_sdi[21 
test_sdi 
test_sdi 
test_sdi 
test_sdo[15] 
test_sdo[16] 
test_sdo[17] 
test_sdo[18] 
test_sdo[19] 
7 test_sdo[20] 


a 
ion) 


—D Ww 
ee} ee 
an 
Ke} 


D 
ey 


I 
j=) 
N 
i 


i) 
x 


857 


a 
or 


an 
ie2) 


ET | 
a | 
| 
a | 
Sai6/17 [fret aay Tiare] RDI 
sai6/i7 _[[test-dlisiav-10-tstele 5022/23 | 
a es es 
a | 
Ee | 
EL | 
EE | 
EC | 
Psai6/ir | reat dar Ta FT| 
Psai6/I7_[[test-dlisiav-t Listed 522/23 
a | 
Ea | 
a | 
| 
sagt |edit IT 


aa! 
Ea 
Ea 
Pear ——SCS—~dSSSSC—S 
Psat [Sid SSC 
Ea 


test_dllslav_13_tstclk] 5‘d22/23 
test_dllslav_13_tstclk# 5‘d22/23 


TC | 
| 
| 
a are Cee 
alg/17 
sai /17 
5'd16/17 
5'd16/17 
5'd16/17 
5'd16/17 


5°d16/17 


5°d16/17 
5°d16/17 


Rev 51328 


SiCortex Confidential 


Normal Signal 
(chip level) 


di_ad[8 
d1_ad|[6 


di_ad 
di_ad 
di_ad 
di_ad 
di_ad 


Test Signal 
(chip level) 


etal _| 
[estsdol6el | 
[estsdolo7] | 
[estsdol6s] | 
[est-sdol6a] | 
[estsdotrol | 
[iest-sdotrt] | 
[eest-sdotra] | 
[iest-sdotra] | 
[eestsdotral | 
[essai] —_| 
[est saitro]_—_] 
[iest-sdotr]_| 
[iestsdotrel | 
[eestsaitrl] —_| 
[rest saitral | 
[eestsaira]—_| 
[iest-scitral | 
[essai] | 
[iestsairrol_—_| 
[eestsaitrm| | 
[estscira] | 
[eestsdotra] | 
[iest-sdotra] | 
[eestsaitra] —_] 
[estsdofra] | 
[et saiso] —_] 
[iestsdofsoq | 
[estsdotsi] | 
[iest-sdols2] | 
[et sail] | 
[estas] | 
[estas] —_] 
[est sdofss1 | 
[estsaisa] | 
[est sdolsal | 
[rest-sdofs5] | 
[estsdols6] | 
[essai] | 
[et saise) | 
[iestsci7] | 


DDR2 PHY 
IO instance 
pin 


ig) 


dx_ad 
dx_ad 
dx_ad 
dx_ad 
dx_ad 
dx_ad 
dx_ad 
dx_ad 
dx_ad 
dx_dq 


dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 


dx_dq 


dx_dq 
dx_dq 


dx_dq 


OU OU] OT] OT] OF 
i=} 


dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
dx_dq 
59 


DD] OU ou QD] D 
WW] CO} NI] O];eR | a> 


dx_dq 


DDR2 PHY 
instance core 
side pin 


test_sdo[21 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdo 
test_sdi 
test_sdi 
test_sdo 
test_sdo 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdi 
test_sdo 
test_sdo 
test_sdi 
test_sdo 
test_sdi 
test_sdo 
test_sdo 
test_sdo 
test_sdi[37] 
test_sdi[38] 
test_sdi[39] 


Piest-sdofso]__[saoni7 | ———S—s SSSSCSC~—S 
Piestsaiiio] __[saon7_ | Sid —SSSC~—S 
Piest-sdolo]__[saroi7 ff ——SSSS—id SSC~SY 
Ptest-sdo[dt] | a16/17 || test dIav IT iweleh aD __—_| 
Ptest-sdo(i2] | sd16/17 |] test-dllsiav-IT-tstclk| 5423/25 | 
Fiesta] [sar P| S| CS 
Piest-saite] [sant | ———SSsid—SSSSCSC~—S 
Prest-sdiis]__[sato7 | ———S—id SSCS 


Test-Mode 
to activate 
{test_mode_e 


test_mode[3:0 


sag Cd 


CHAPTER 17. 


2’ndary Test Signal 
(choice of specific 
| dq’s is a guess for 
to be final- 
ized when esilicon is 


now, 


ready) 


PINOUT 


2’ndary 
Mode 
activate 


di_dq[62 || test_sdo[87] dx_dq[62 test_sdo[43] sdi6/i7 ff} 


Petsdis] i SSCSCSC~dSC“‘CSNC*”i;COOOOOCOC~C*dSCi‘“‘“‘(C™CS™*d*SC OSU Mesten GT] DP 
Peest-sdoso] i id i= test Mtoe 
Peest-saioo] i Sd i —*di test Mtr | 122/23 
Peest-sdofo] i id i *Y test Mato ial] 2228 
Peest-saioa] Si SS S—*d test Mtoe] 922/23 
Peest-sdofos] Pi Sd Si testator il] 9182228 


May 14, 2014 858 Rev 51328 


SiCortex Confidential 17.3. LIST OF NORMAL-MODE SIGNALS AND THEIR TEST-MODE OVERRIDES 


Normal Signal Test Signal DDR2 PHY | DDR2 PHY | Test-Mode 2’ndary Test Signal | 2’ndary 
(chip level) (chip level) IO instance | instance core | to activate || (choice of specific | Mode 
i side pin {test_mode_em| dq’s is a guess for | activate 
test_mode[3:0]} now, to be final- 
ized when esilicon is 


ready) 
test_dll_MasterAdj{1] 


test_diL MasterAdj{0] 


P| testa StaveO al 
test_sdi[98 
test_sdi[99 
test_sdi[97 
test 
rst Stave ac) 023/25 | 
iestsdo(2 a Ce 
rest aster Ait] 9822/23 

2 

1 

0 


[9] 
iest.sdo(00 estat stave air] 9032/23 
rest ai Stave ai] e223 
testo Presta Stave Aor] 9822/23 
Tf rescaitisterrisy — e2223 | 
 restcatisteria] [e223 
rest aitistciis) [2223 

syannodell a | 
syenodefo] | i i Cid escaistertt] Pree 
Paveiaesda PS tect) 28 
FS 


More test-mode overrides for the standard I/O block on the West (odd) end of the North (pci) side: 


Normal Signal Test Signal Test-Mode Test Signal Test Mode 
to activate to activate 
{test_mode_enl} 
test_mode[3:0 


f 
Faia] | apap [sao ‘| ——S~dCSCS~S 
Feest-sdofso]___|[ dima] [sa ‘i ———s«d—S—S 
test-[90 ddp-dviv-impedi] [sao [SS 
Peest-sdoloi]___|[diimpm] fsa +i ——S~dSCSCS~S 
test ddp-dviv-imped] [sao || ———SSS—Sid 
Feest-sdolos]___|[d0impoi] [sai sd SS 
test add CS | 
Ftest-sdol95]___|[ d0.impvp] [sa ————=~«d——S—s 
testi 96 ddp-tems0) [sao ——SSSS—S~i 
Feest-sdolo7] || a0.impnl] [sai ———~+d——S—s 
Peest-sdis] || ddp.tormis0___—_—p said SSSSSCSCSCSSSSSCSC~—S 
test-sdo(9o] || dd-mp-nis]_ [sai ———SSsidSOSCSCS 
Pets] iE SCSC~iSCSCSC~dYS OSC SCS 
test-sdo(os] || imps a——CidCSSC~‘“‘“‘;<S] :SSSSC*d 
| 
testdo[96 imp alo a 


May 14, 2014 859 Rev 51328 


SiCortex Confidential 


Normal Signal 


test_sdi[95 


May 14, 2014 


Test Signal 


CHAPTER 17. 


Test-Mode Test Signal Test Mode 
to activate to activate 
{test_mode_enl} 
test_mode[3:0]} 


dl_imp_p[3] (tie OE | 5’d19 
on 


[compe ____[ sai 


1 


on 


_imp_p 


2 


[amp ___[ a9 


1 


_imp_p 


_imp_n 


_imp_p 


1 


(tie OE | 5’d19 
(tie OE | 5’d19 


ft 
I 
a | ee 
aisles iS Pel 
EE | 
| a Coes 
| ee 
ee 
a (es 
| Scene (eee 
| te — 
| ee 


860 


PINOUT 


Rev 51328 


Chapter 18 


Programming Considerations 


[Last modified $Id: pguide.lyx 42289 2007-07-24 15:55:03Z wsnyder $] 


18.1 Overview 


The rest of this document is pretty detailed. While you could probably find all you need to know in the spec, 
we’ve attempted to get all the peculiarities relating to programming the chip right here. In all cases, the procedures 
and rules outlined here are meant as programmer’s hints. 


18.2 Memory Transactions and Ordering 


18.2.1 The Sync Instruction 
18.2.2 I-Stream vs. D-Stream Accesses 


18.2.3 I/O ordering 


I/O writes from a single CPU are processed in strict order within the memory system, but once the writes leave 
the memory system, there is no longer any guarantee of ordering. For example, a write to an SCB register may not 
complete (take effect) before a write to a subsequent DMA engine register. To enforce ordering in situations like 
this, do an I/O read to the SCB register before doing the DMA engine register write (sync is not required). 

When sending SPCL operations to the DMA engine, you must issue a SYNC instruction between every pair of 
SPCLs, or some SPCLs may be lost in the L2 cache. 

The DMA engine has a bug (#1991) which can cause RDIOs to return corrupted data when followed immediately 
by a WTIO from the same CPU. I/O accesses from different CPUs are not affected, and SPCLs are not affected. 
When it happens, the WTIO overwrites the data before it can be sent back to the core, so the RDIO incorrectly 
returns the data from the WTIO. To avoid this, either issue a SYNC instruction between the RDIO and WTIO, 
or be sure to use the RDIO result before issuing the WTIO. All DMA addresses are affected (RA-DmaImem, 
RA_DmaDmem, RA_DmaApplface0,1, etc.) except for those in the SCB range (RA_SDma*). The bug has only 
been observed when DMA is in the process of doing lots of block writes and the CSW is heavily loaded. 


18.2.4 D-Stream vs. I/O Operations and Interrupt Delivery 
18.2.4.1 I/O read / Block Write interaction 


I/O reads can have a hardware interaction with DMA (or PCI) block writes which has a substantial performance 
impact. If CPU X is doing an I/O read to some device that’s really far away, a DMA or PCI BWT to a cache 
line A which is owned as D-stream by that processor will not complete until the I/O read completes, regardless of 
whether CPU X has any intention to use line A. Since the DMA engine writes out received packets using BWTs, 
this can have a meaningful performance impact on DMA latency. 

Software which wishes to use the DMA engine in a high-performance manner can prevent this unhappy circum- 
stance by mapping its DMA receive buffers to physical pages which are not present in the caches of any processor 
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which does I/O reads to far away places. Note that I/O reads to cache-local addresses (e.g. interrupt registers) 
will never have this interaction, nor will I/O writes of any kind. 

The hardware reason that this case occurs is that both I/O reads and BWTs that hit in a local L2 cache require 
exclusive use of that L2 cache’s “might receive data soon” resource, and if the I/O read gets it first, the BWT might 
have to wait a while. 


18.2.5 Oddball Address Spaces and Physical Addressing 
18.2.6 Error Traps 

18.2.7 Interrupts and Interrupt Handling 

18.2.8 Address Aliasing 


Processor segment local control registers (RA_CacLoc registers) are assigned addresses in the range 0xE9E000000 
to OxE9E001000. Addresses in the range 0xE9E000000 to 0xEQ9EOOOFFF may be decoded such that bits 11 and 10 
are ignored. This means that addresses alias in this region such that 0OxE9EO00Cxx, 0xE9E0008xx, 0xE9E0004xx, 
and 0xE9E0000xx all address the same register. Similarly addresses {0OxEQ9E000Dxx, 0xE9E0009xx, 0xE9E0005xx, 
and 0xE9E0001xx} and {0xE9E0O00Exx, 0xE9EO00Axx, 0xEQE0006xx, and 0xE9E0002xx} and {0xE9EO00F xx, 
0xE9E000Bxx, 0xE9E0007xx, and 0xE9E0003xx} form sets of aliased addresses. 


18.3. The DRAM Controllers 


18.3.1 Initial Calibration and Setup 


One of the steps involved in DDR calibration involves forcing a write or read tp address X to go to DDR (and 
not get caught in a cache). For the L2, this is done by previously reading two other addresses Y and Zwhich are 
known to collide with X. The subtle part is that a sync is required after the two setup reads, because part of the 
job of the reads of Y and Z is to flush X from the L1. Since the CPU processes hits under misses, if Y or Z is a miss 
and X would have been a hit, we need to sync to make sure Y and Z have evicted X from the L2 before moving on 
to read it. 


18.3.2 On-the-fly ReCalibration 
18.3.2.1 Software filtering of impedance calibration settings 


The drive & ODT calibration settings for the DDR I/O cells come from the PDDR2CAL cell. This uses a 
precision resistor on the board to calibrate out process, temperature and voltage variation effects for precise output 
drive strength (output impedance) and on die impedance termination (ODT). Because the calibration may produce 
spurious results, hardware is provided to allow for software filtering of the calibration settings before they are 
applied to the I/O cells. 

Here are some portntially important things to know in designing the software filtering algorithm 

(These are pasted from email; the formatting isn’t pretty, but then, this is Lyx.) 


1. Are IMP_P[3:0] & IMP_N[3:0] reset by CAL_RESET? If so, what values do they 
take on at reset? 


Answer: yes, they are reset, and the values in the SuperPhy are the same as 
the SS values in the email below. 
IMP_P[3:0]= 4’b1100 
IMP_N[3:0]= 4’b1001 
The reason is that when you power up, the CSN signal going to the DIMM from 
the ASIC should be an immediate ’1’, so the SSTL18 buffers must have 
sufficient drive strength under all PVT conditions. This also implies that 
the DRIVE[] values from the core to the SuperPhy for CSN (and CLK) must also 
have an appropriate value as well: 
cti_clk_driv_imped[] <------------- 
cti_addr_driv_imped[] 
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cti_ctrl_driv_imped[] <------------- 

cti_dqsO_driv_imped[] 

cti_dqsi_driv_imped[] 

cti_dqs2_driv_imped[] 

cti_dqs3_driv_imped[] 

cti_dqs4_driv_imped[] 

cti_dqs5_driv_imped[] 

cti_dqs6_driv_imped[] 

cti_dqs7_driv_imped[] 

cti_dqs8_driv_imped[] 

cti_dq_b1l0_driv_imped[] 

cti_dq_bl1i_driv_imped[] 

cti_dq_b12_driv_imped[] 

cti_dq_b13_driv_imped[] 

cti_dq_b14_driv_imped[] 

cti_dq_b15_driv_imped[] 

cti_dq_b16_driv_imped[] 

cti_dq_b17_driv_imped[] 

cti_dq_b18_driv_imped[] 
Finally, the ODT in the ASIC should be turned off, which it will be due to 
the resetn effect on, for example, ddo_dqs_roe[0]. Note: clk and CSN have 
these signals permanently turned off in the SuperPhy. 


2. For software filtering of IMP_P[3:0] & IMP_N[3:0]: 
- what is the counting sequence as settings cause decreasing impedance? 


Answer: 


For N: 9 is slow, 5 is typ, 3 is fast PVT. 

So, if you have a single part sitting on the bench, operating with some 
fixed voltage, temp, and process, all unchanging, then increasing 
IMP_N[3:0] will decrease the output impedance. 


For P: 12 is slow, 7 is typ, 4 is fast PVT. 
So, increasing IMP_N[3:0] will decrease the output impedance. 


- what are the expected nominal (i.e., TT process, 1.0V, 25C) values? 
Answer: N= 5, P= 7. 


- how much should we expect to see the values change with voltage & 
temperature, i.e., sensitivites in LSBs /mV & /degree-C? 


Answer: would have to do another HSpice sim to find this. But, based on 
the PVT factors of [1.321, 1.185, 1.101], then a coarse answer would be: 


N: (9/5 - 1)* (0.185 / 0.72349)= 20.46% 
i.e. 20.46% change for 100mV delta, or 0.2046% change for imV delta. 
==> 0.2046% * 5 = 0.01023 numeric change / mV. 
==> "ImV delta" will require changing the setting from: 
5 to 5.01023. 


P: (12/7 - 1)* (0.185 / 0.72349)= 18.26% 


i.e. 18.26% change for 100mV delta, or 0.1826% change for imV delta. 
==> 0.1826% * 7 = 0.012782 numeric change / mV. 
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==> "1imV delta" will require changing the setting from: 
7 to 7.012782. 


N: (9/5 - 1)* (0.101 / 0.72349)= 11.168% 
i.e. 11.168% change for 100C delta, or 0.11168% change for 1C delta. 
==> 0.11168% * 5 = 0.005584 numeric change / C. 
==> "1C delta" will require changing the setting from: 
5 to 5.005584 


P: (12/7 - 1)* (0.101 / 0.72349)= 9.9715% 
i.e. 9.9715% change for 100C delta, or 0.099715% change for 1C delta. 
==> 0.099715% * 7 = 0.00698 numeric change / C. 
==> "1C delta" will require changing the setting from: 
7 to 7.00698 
18.3.3. DDR Impedance Calibration and Bug 2013 


See Section 8.4.8.36 for a discussion of the different auto calibration modes. Note that CalMode 2 is not 
currently supported. If any memory transaction is in flight at the time an autocalibration in mode 2 is initiated, 
the autocal state machine will hang and prevent completion of the calibration loop and thus completion of the 
memory reference. 


18.4 Initializing the PMI/PCI Controller 


18.4.1 Unused PCI Controllers 
18.4.2 PCI Controllers With Connected Devices 
18.4.3. PCI Controllers With No Connected Device 
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Differences, Bugs, and Enhancements 


19.1 Overview 


This chapter summarizes the product differences and errata for the different SiCortex chips. See the corre- 
sponding chapters for more information. 


19.2 User Code 


19.2.1 Product and Chip Pass Differences 


1. 


ICE9B fixes bug2619 whereby ICE9A requires double load-linkeds to insure atomicity. This also removes 
the rationale for the suggestion in bug2807 that R-CpuConfig_LLTIME be programmed to 1 or greater to 
allow enough time for most atomic sequences to complete; _LLTIME may now be programmed to zero. 


. ICE9B1 fixes bug2826 whereby Multiply Double and friends may get a incorrect results when not followed 


by a idle cycle, or after write-after-write stalls. This afflicted madd.d, msub.d, mul.d, nmadd.d, nmsub.d, 
recip.d, rsqrt.d, and sqrt.d. 


. NEED IMPL: TWC9A adds more CPU cores, for a total of 10. 


. TWC9A uses a new core, IceT. This is described in a different document. 


19.2.2. Known Bugs and Possible Enhancements 


1. 


None. 


19.3. Processor Core 


19.3.1 Product and Chip Pass Differences 


1 
2. 


ICE9B returns a different product (ICE9B) when reading R-CpuPRId and R_CpuTapIDCODE. 


ICE9B fixes bug1965 whereby R_CpuErrCtl reads swap bits 31 and 28. In ICE9A any read-modify-writes 
need to swap these bits before writing them back. 


. ICE9B improves micro DTLB performance bug 2200 with a entry size of 64KB when the corresponding 


TLB entry is 64KB or larger. If the TLB entry is 16KB, the old 4KB uTLB entry size is used. 


. ICE9B improves probe performance by using 64 byte probes, see bug2202. 


. ICE9B removes an unnecessary syncronizer on the cac_cpu_int wires, this reduces interrupt latency by one 


pelk. 


. ICE9B adds performance counter events for L2 misses and floating point operations, and allows all events 


to be visible to both counter 0 and counter 1. 
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7. TWC9A returns a different product (TWC9A) when reading R-CpuPRId and R_CpuTapIDCODE. 


8. TWC9A uses a new core, IceT. This is described in a different document. 


19.3.2 Known Bugs and Possible Enhancements (M5KF only) 


1. On D-Cache ECC errors, R-CpuCacheErr_EW may record the incorrect way number and index, see 
bug1575. As a workaround, software should flush the entire cache on ECC errors. 


2. On filling the TLB with a 4KB page, we should pull a machine check, as 4KB pages are not supported. 
3. On writes to accelerated space, we should pull a machine check, as they are not supported. 


4. We should add a 64-bit cycle counter which is NOT writable, as the current count register is occasionally 
overwritten by the kernel, bug3342. 


5. We should implement the RDHWR instruction so user space code can see the cycle counter and processor 
number. 


6. We should add more VA bits, to enable the VA to be unique across the entire system. 


19.4 Addressing 


19.4.1 Product and Chip Pass Differences 


1. TWC9A adds some values to the AddrBusStop enumeration to support the additional cores, bug3377 . 


19.5 L2 Cache 


19.5.1 Product and Chip Pass Differences 


1. TWC9A’s L2 cache is part of the new IceT core, and is described in a different document. 


2. TWC9A adds the CswStopNumTwe and CswTidTwe enumeration to support more cores, and more TIDs 
per core, bug3377. 


3. NEED IMPL: TWCOQA fixes the R-CacxIntCr[#]_Overflow bit being mis-cleared when clearing R-CacxIntCr[#]_Active, 
bug3165. 


4. NEED IMPL: The R-CohxEccMode_CorEna bit must be set whenever the ICE9 caches are active, bug1990. 
5. NEED IMPL: TWCO9A pushes IO writes instead of using a special command, bug4898. 


6. NEED IMPL: TWC9A removes SPCL in favor of IO writes, bug4899. 


7. NEED IMPL: TWCO9A stalls issuing probes to avoid large per-cpu probe queues. 


19.5.2 Known Bugs and Possible Enhancements 
19.6 Memory Controller 


19.6.1 Product and Chip Pass Differences 


1. ICE9B fixes the DDR unit to support IO driver calibration before the DRAM initialization sequence, bug2276. 
In ICE9A the Ddr/Ddp units currently only support updating values into the IMP_P_HV[3:0] and IMP_N_HV{3:0] 
inputs of the DDR2 IO cells during one of the mission mode time CalModes. When SoftReset is asserted the 
PHY puts in default strong values (low impedence biased) into these. 
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2. 


3. 


4. 


ICE9B fixes some of the ODT on/off range values, bug2401. The NWL controller was supposed to support 
the following range of ODT turn on/off times for Ice9a’s DDR-Phy: ON time range: controlled by Ddrx- 
PhyCfg2_AsicDqsOdtOn and DdrxPhyCfg2_AsicDqOdtOn -2.5 clocks <-> 0 clocks (in half cycle increments) 
relative to the start of the read preamble OFF time range: controlled by DdrxPhyCfg2_AsicDqsOdtOff and 
DdrxPhyCfg2_AsicDqOdtOff -1.5 clocks <-> 2 clocks (in half cycle increments) relative to the start of the 
read preamble. However, the bug causes the -2.5 and -2 clocks turn on times to NOT work with turn off 
times of 1.5 and 2 clocks. 


TWCO9OA fixes access to any SCB bus slave hanging while the DDR controller is in reset, bug2928. 


NEED IMPL: TWC9A drops support for unbuffered DIMMs. 


19.6.2 Known Bugs and Possible Enhancements 


a 


2. 


Calibration Mode 2 can cause Ddi to hang waiting for Powerdown, see bug2013. When setting AutoCalUpdate 
in cal mode 2 (update during prechargePowerdown), the Ddi can hang. This is caused when a request is at the 
head of the queue requesting to be sent to the controller at the time we start the calibration update process. 
The calibration logic spins in place waiting for powerdown entry. However, this pending request causes the 
powerdown counter to be cleared on every cycle, which blocks the Ddr from ever entering powerdown mode. 
To workaround, do not use calibration mode 2. 


The DDR bank address could be changed to better optimize page hits, bug2068. 


19.7 PCI 


19.7.1 Product and Chip Pass Differences 


1. 


ICE9B fixes legacy interrupt D behavior incorrect during a link down, bug1984. In ICEQA if an AS- 
SERT_INTD message arrives from the endpoint, software will service the interrupt. During this time, if 
the link goes down, an implicit DEASSERT_INTD should occur, but this did not happen. So if the interrupt 
service routine ends with a wait for DEASSERT_INTD”, and it is possible that it will hang forever. 


. ICE9B fixes ecc error ignored when CLEAR comes at the same time, bug2028. In ICE9A if an ECC error 


is in effect and the interrupt is raised. Some time software clears the interrupt and an ECC error comes at 
the same time (in PMI where is checks, or not checks, for ecc error and clear), PMI ignores the second ECC 
error. 


. ICE9B fixes the MsiBaseAddr register addressing, bug2097. In ICE9A, software has to program the PMI 


MsiBaseAddr register with an Ice9 address converted into a PCle space address (look at the address mapping 
in the hardware spec). 


. ICE9B fixes RX detection not being completed when some lanes are disabled, bug2113. In ICE9A, when one 


or more lanes of a multi-lane link are disabled using TxCompliance/TxElecIdle as described in Section 8 of 
the PIPE specification, initiating a receiver detection sequence will cause the PCS layer to hang due to the 
“turned off’ lanes not performing the receiver detection operation. To workaround, enable all lanes prior to 
performing a receiver detection operation, as lanes which are turned off will not participate in the receiver 
detection sequence. 


. NEED IMPL: TWC9A fixes only the bottom 16 bit being writable in R-PmiVmReqDat, bug2760. We 


couldn’t find any PCIe vendor which uses vendor messages, so this is of only minor concern. 


19.7.2 Known Bugs and Possible Enhancements 


1. 


None. 
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19.8 DMA 


19.8.1 Product and Chip Pass Differences 
1. NEED IMPL: TWC9A records the address and syndrome of DRAM ECC errors, bug2157. 


2. NEED IMPL: TWOO9QA fixes generation of bad ECC when ECC correction disabled and a 32-bit aligned 
packet is read, bug2396. R_SdmaEccMode bit 6 (CifCorrEna) enables ECC correction in CIF. This logic is 
only needed when the microengine does a BRD from a memory address with bit 2 set (32-bit realignment). 
When CifCorrEna is off and the microengine does a BRD from a memory address with bit 2 set, the ECC 
written into the DMA’s internal memory (TX or COPY port packet buffer) is incorrectly forced to zero. Data 
with corrupted ECC may reach the FSW or main memory when the packet is sent. To workaround, leave 
CifCorrEna always set. 


3. NEED IMPL: TWOO9A fixes non-correction of ECC during 32-bit realignment operations, bug2403. When 
the CifCorrEna bit is on, and DMA is doing a read with 32-bit realignment, and there is a single bit error 
on the data from the CSW, the RTL does not correct the error. The RTL corrects the error inside the 
DmaCifDatacalg modules, but then incorrectly puts out the uncorrected data on cif_xxx_Data* [63:0] and into 
the next DmaCifDatacalg module. But the ECC bits on cif_xxx_data*[71:64] are the ECC consistent with 
the corrected data, so the resulting data appears to have just a single bit error. Workaround: None needed, 
as the error will be corrected at the destination of the DMA engine. 


4. NEED IMPL: TWOC9A might double the size of the instruction memory, bug3390. 
5. NEED IMPL: TWC9A removes SPCL in favor of IO writes, bug4899. 
6. NEED IMPL: TWOC9A removes 32 byte writes to support DDR x4 parts, bug4793. 


7. MIGHTFIX: TWC9A might fix a performance issue which requires a dead cycle between DMA packets headed 
into the FSW, bug597. 


8. MIGHTFIX: TWC9A might fix DmaCif RDIO being corrupted by subsequent WTIO from the same core, 
bug1991. This can cause RDIOs to return corrupted data when followed immediately by a WTIO from the 
same CPU. I/O accesses from different CPUs are not affected, and SPCLs are not affected. When it happens, 
the WTIO overwrites the data before it can be sent back to the core, so the RDIO incorrectly returns the 
data from the WTIO. To avoid this, either issue a SYNC instruction between the RDIO and WTIO, or 
be sure to use the RDIO result before issuing the WTIO. All DMA addresses are affected (RA_DmaImem, 
RA_DmaDmem, RA_DmaApplface0,1, etc.) except for those in the SCB range (RA_SDma*). The bug has 
only been observed when DMA is in the process of doing lots of block writes and the CSW is heavily loaded. 


9. MIGHTFIX: Various possible microinstruction enhancements, bug3392, bug3393, bug3394, bug3395, bug3396. 


19.8.2 Known Bugs and Possible Enhancements 
19.9 Fabric Links 


19.9.1 Product and Chip Pass Differences 
1. NEED IMPL: TWC9A fixes certain noise patterns from causing fabric deadlocks, bug2132. 


2. NEED IMPL: All FL internal counters’ increment signals should be wired into the SCB counters, bug3488. 


19.9.2 Known Bugs and Possible Enhancements 


1. Force retraining should always complete, and software shouldn’t have to detect and implement retries. 


2. The out-of-band path was never used by software, and could be removed for simplicity if desired. 
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19.10 Fabric Switch 
19.10.1 Product and Chip Pass Differences 


1. 


None. 


19.10.2 Known Bugs and Possible Enhancements 


1. 


The FSW has an architectural performance limit preventing 4 ford packets at max rate, bug1832. 


19.11 SCB 


19.11.1 Product and Chip Pass Differences 


1. 


2. 


3. 
4. 


ICE9B returns a different product (ICE9B) and/or revision (ICE9A1 vs ICE9A0) when reading R_ScbChipRev. 
ICE9B has reduced latency accessing the SCB’s own registers. 
ICE9B adds a interrupt/attention for when the Chip<->Msp channel is ready for transmit. 


ICE9B adds R_ScbDInt to replace the SysChain R_SysTapDint register, see bug2223. 


. TWCOOA returns a different product (TWC9A) and/or revision when reading R_ScbChipRev. 
. NEED IMPL: TWC9A supports 64 bit SCB slaves and 64 bit registers, see bug4619. 


. TWCO9A adds R_ScbDInt_SendDInt6, R_ScbDInt_Cpu6DM, R_ScbAtnInt_Cpu6DM Mask, and R_ScbAtnInt_Cpu6DM 


to support CPUs 6-9. 


. TWCQ9A fixes reads to fast DDR clock registers returning the wrong results after a CCLK register read, 


bug4331. Earlier chips required a dummy read between such read sequences. 


. TWCQ9A will skip sampling bucket pairs where R_ScbPerfBuckets_Event == AllEvent_INVALID. This is 


backward compatible with other products, which should use that encoding for invalid buckets. bug4265. 


19.11.2 Known Bugs and Possible Enhancements 


1. 


In ICE9A and ICE9B, all SCB accesses must be done with 32-bit accesses. Using a 64-bit read/write to 
access them will put return/write data in the wrong half of the quadword, not simply return or write half of 
the data. 


. Decouple the SCB CPU#_P[01] events from the CPU performance counter domain (U/S/K), perhaps with 


new domain bits. 


. SCB performance counts from Ocla TrbC blocks depend on the TrbC configuration, this could be simplified. 


bug1717. 


. R_ScbPerfEna should have a way to stop immediately, without corrupting, for interrupt handlers. Perhaps 


add a _Pause bit that stops on current bucket and partial interval. We’ll also need to make the partial interval 
programmable so context switches can reprogram it. 


. R_ScbPerf* registers should be writable without needing to stop sampling. 


. R_ScbInt should indicate what bucket(s) have caused the overflow, to save software from having to read the 


entire count ram on each overflow, bug2164. 


. R_SysTapMsp transactions should be double buffered, as the Msp decision loop is quite slow. 


. R_ScbInt like most of the other blocks in the chip contains the interrupt state before masking. This requires 


the interrupt handler to read (or cache) R_ScbIntMask before dispatching interrupts. 
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19.12 LBS 
19.12.1 Product and Chip Pass Differences 


1. 
2. 


13. 


ICE9A1 returns a different revision (ICE9A1 vs ICE9AO) when reading the IDECODE register. 


ICE9B fixes Sms Reset syncronized to the wrong clock, bug2055. This required the smsclock to be turned off 
whenever we wiggle reset, then turned on again a bit later. 


. ICE9B eliminates R_SysTapDint, replaced with the SCB-space R_ScbDInt, bug2223. 
. ICE9B supports transmit interrupts for R-SysTapAtnMsp, and separates RWIC bits, bug2222. 


. NEED IMPL: TWC9A changes the default value for R-SysTapPILD*clkDifv to support a processor default 


clock frequency of *FIX* MHz, bug3384. 


. TWCO9A fixes access to any SCB bus slave hanging while the DDR controller is in reset, bug2928. 


. TWOO9A adds an R_SysTapReset_Lac and _Pmi to separate the R_SysTapReset_Scb bit from also controling 


the BBS/PMI reset, bug2929. Earlier products needed caution when maintaining FSW/FL traffic during 
partial reboots. 


. NEED IMPL: TWC9A adds R_SysTapReset_Proc6, and _ProcSms6 to support the additional cores. 
. TWCY9A uses R_SysTapInstrTwe instead of R_SysTapInstReg to support the additional cores. 


. TWC9A adds R_SysTapScb64 to access doubleword SCB registers. Code should use this new registers or 64 


bit SCB registers will not be visible. 


. NEED IMPL: TWCOQ9A adds R_SysMemInit register and associated functions for on-chip memory initalization. 


In previous products BIST was used to initalize on-chip memories. 


. NEED IMPL+SPEC: TWC9A will merge the SysChain and E-Silicon chain on-chip instead of off-chip. 


NEED IMPL+SPEC: TWC9A will replace or make the E-Silicon chain IEEE compliant (on the correct edges). 


19.12.2 Known Bugs and Possible Enhancements 


1. 


[Larry] Add a new LBS+SCB region. The msp could set the start address in 32 or 64 bit steps, and then scan 
in, say 128 bytes with a continuous shift on the scan. Then, while the ice9 digests that block, the msp scans 
in 128 bytes into the alternate half of the block. This is essentially a block of shared memory accessed on the 
ice9 side by scb and on the msp side by efficient scan. The scan chain would shift in a direction compatible 
with the qspi as well. This shared area would be used instead of fastdata (since it would be much faster) for 
boot2 loading, and we would also use it for block transfers of attn data instead of doing that 26 bits at a time 
via the current attn register. 


19.13 UART 
19.13.1 Product and Chip Pass Differences 


1. 


FIX NEED IMPL: TWC9A removes the UART flow control signals. They were never used on the ICE9 
modules. 


19.14 OCLA 
19.14.1 Product and Chip Pass Differences 


1. 


ICE9B fixes GO->0 should shut OFF collection, bug2246. CollectTrace can be left ON by stopping an OCLA 
program that had not yet seen it’s trigger. CollectTrace can only be controlled by a running OCLA program, 
so you can’t shut it off by SCB writes. While CollectTrace is ON, you cannot dump any CTBs. Workarounds: 
(a) A Diagnostics Dash script has been written that loads and runs a minimal OCLA program to shut off 
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10. 


11. 


12. 


13. 


CollectTrace. (b) The OCLA dump program has been written to detect CollectTrace=ON, and exit with 
meaningful error message. (c) OCLA Dash scripts and all example OCLA programs have been written with 
a agraceful exita option, where a specific register-write tells it to shut CollectTrace OFF and stop watching 
for the trigger it didn’t get yet. 


. ICE9B adds new INCRBTH Opcode, bug2179. In ICE9A, although OCLA has 2 counters, you cannot count 


2 events concurrently, because if both happen on same clock there’s no way to increment both counters. 


. ICE9B enlarges counters from 12 to 16 bits, bug2244. 


. ICE9B fixes PMI CTB ExtMuxSel wired to TRBC, bug1959. The ExtMuxSel wires of OCLA PMI CTB were 


wired to the SCB register that’s supposed to control OCLA PMII TRBC. To workaround, write desired PMI 
CTB ExtMuxSel value to ExtMuxSel field in control register for PMII TRBC. Fortunately, PMI TRBC has 
no ExtMux, so this field is otherwise unused. Simplest solution without determining whether you have Ice9A 
or ICE9B is write desired PMI CTB ExtMuxSel value to both ExtMuxSel fields. 


. ICE9B fixes CAC trigger PrbState obscured by WtPrb2L2, bug1995. OCLA CAC TRBC mux=2 signals 


PrbState[2:0] had WtPrb2L2 OR-ed into PrbState[2]. To workaround, don’t use PrbState as a trigger, or 
only trigger on PrbState groups of state that you can identify with bits [1:0]. 


. ICE9B fixes CAC trigger WOHit/W1Hit instead of WOMiss/W1Miss, bug2243. In ICE9A, both CAC 


Trigger Block and Collector Block hookups: (a) Change WOMiss/W1Miss to something better, perhaps 
WOHit/W1Hit. Miss is including Idle and I/O. (b) Adjust flops so WOHit/W1Hit in same clock with related 
signals. To workaround, (a) qualify with not-Idle and not-IO. (b) Separately feed Hit and the other signals 
to LAC in separate triggers, then align them with Dly regs in LAC. 


. MIGHTFIX: TWC9A might fix OCLA to SCB uses LAC triggers, bug1717. Passing OCLA events from 


trigger blocks to SCB Counters ties up LAC trigger configuration, usually preventing simultaneous OCLA 
use for other purposes. To workaround, accept that you are tying up OCLA with this. The cross connections 
between OCLA and SCB counting may not be used that much. You might prefer to count SCB events in 
SCB counters, and count OCLA events in OCLA counters. 


. MIGHTFIX: TWC9A might allow trigger delays for blocks located in other than the CCLK domain, bug1854. 


. MIGHTFIX: TWC9A might add capture mux settings for the CPU program counter and L2<->L1 signals. 


NEED IMPL: TWC9A might add capture mux settings for the FSW links 1 and 2, bug2232. 


MIGHTFIX: TWO9A might fix DMA CTB qualifier in wrong clock, bug2193. In DMA’s hookups to OCLA, 
the ue_xxx_DbgValid_c2a signal is sent into the trigger block and CTB, when really it should be delayed by 
two more cycles. In the CTB as a qualifier we pretty much cannot use it, because you want to use it in 
combination with other signals like DbgThread_c4a and DbgPc_c4a. To workaround, only do un-qualified 
collection in DMA CTB. In DMA trigger block, send it and other signals separately on the 2 triggers to LAC, 
where the Dly regs can align them. 


MIGHTFIX: TWC9A might add a WtAddr sticky overflow bit, bug2207. 


MIGHTFIX: TWC9A 


19.14.2 Known Bugs 


ale 


Overflow bits still set as OCLA starts, bug1825. OCLA’s automatic clearing of counter overflow bits when 
you start LAC program is delayed a clock or two. Early instructions in LAC program can falsely trigger on 
overflow depending on the previous use of OCLA. To workaround, never branch on Counter Overflows in first 
2 instructions of any LAC program. 


. C CTB WtAddrClr triggered by any address in CTB, bug2026. Writing 0x10 to any SCB register address in 


a particular Ocla CTB can trigger WtAddrClr (clear write address reg). This even includes unused addresses 
within the SCB address space of a CTB. To workaround, never write any of the read-only registers. 
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19.14.3. Possible Enhancements 


1. Make both LAC counters 32-bit (currently 16-bits plus sticky overflow bit). There’s only one instance of the 
LAC, so this is very affordable. We’ve wanted bigger counters when writing LAC programs, and unanticipated 
but valuable use of OCLA as a highly-configurable counter would benefit from full 32-bit counters. 


2. Separate “GO” Register. When you write OCLA management software for one of Ice9’s embedded processors, 
or for the external SSP, you tend to write one function that configures OCLA ahead of time, and another 
function to tell OCLA to “GO” at roughly the right moment. Currently the GO bit shares register R-LacCtl 
with some configuration fields that need to be written correctly for what you want OCLA to do. This 
contributes to messy software design in that you must have handy the values to write to those fields when you 
write a 1 to GO to start the LAC program. It would be nice if all OCLA configuration could be encapsulated 
in, and completed by an OCLA configuration function. 


3. If SCB reg addresses are cheap, consider breaking R_LacCtl into 3 or 4 registers by type of access, making 
software easier to write. 


4. Collect ON/OFF by Register Write. Provide a super-simple alternative to writing a LAC program, for when 
exact timing of collection is not critical. Provide one or two registers that allow you, by SCB register write 
alone, to turn on and off CollectTrace to the CTBs. This allows someone with minimal knowledge of OLCA 
to quickly collect some trace information and read it out, just by doing easy-to-understand SCB writes and 
reads. Some semi-steady-state activities can be viewed at an arbitrary time, or you could try more than 
once till you see it. Or, for more accuracy, you could have Ice9 embedded processor code trigger collection at 
roughly the right time, and rely on the 1024-entry size of the CTBs to give you a pretty big window to land 
in. These reg writes would the same logic as the SETCOLL and CLRCOLL opcodes from LAC. 


5. Trigger by Register Write. There are ways to do this now, but they’re a little obscure. I’m suggesting a 
very-simple up-front way to trigger your LAC program by writing an SCB register in LAC who’s sole purpose 
is to do this. Aggregate Mask and Match bits 0 and 1 are available, so why not have them driven directly 
from such a register. 


6. Clarify When CTB Has New Contents. Currently it’s a little hassle to do sanity checks that your CTB really 
got new contents from running your LAC program. Especially when you are wondering if you configured 
everything correctly. You can “trust that a good-status completed LAC program means you have new CTB 
contents”. You can alternate the CTB’s external mux between what you want to collect and something else, 
then read-out the CTB and see that contents changed. 


7. CTB Zeroing. An SCB-register “ClearCtb” action-bit in each CTB, that would zero-out the CTB (taking 
1024 clocks). This bit could be readable and self-clears after the 1024 clocks have passed, so it’s safe to start 
a new collection. 


8. StopOnFull Final Address. Currently, in StopOnFull mode, when the CTB gets full and stops collecting, the 
final address is 0x000, which is the same address it would have if it never started! Either change this to stop 
at Ox3FF, or have a sticky overflow bit which clears when you write WtAddrClr in R_CtbxColCtl. 


9. StoppedOnFull Status Bit. If in StopOnFull mode, have a read-only bit StoppedOnFull in R-CtbxColCtl. 
This signal already exists in the CTB Verilog code. 


10. Fix the “Collecting” Status Bit. Bit “Collecting” of R-CtbxColCtl is directly flopped off of lac_ctb_Collect'Trace_c0a, 
which means it doesn’t take into consideration a CTB in StopOnFull mode that has become full. Reading of 
the CTB works in that case. Change Collecting to be false if StopOnFull and full. A signal with this info 
is available in the CTB verilog code. You might also consider having “Collecting” read back as 0 when En- 
ableCollect==0. To be able to see the level of signal lac_ctb_Collect'Trace_c0a clearly in one central place, add 
read-only bit “CollectTrace” to R-LacCtl (or if R-LacCtl gets broken-up into several registers as suggested, 
put this bit in whatever register contains the other read-only fields). 


11. Have OxFFFFFFFF Indicate Bad Read. If you try to read the contents of your CTB when you cannot, 
you currently get all-zeros. All-zeros can mean you never collected anything, and also for some units it’s 
a likely read-result if you collected during an idle time. A tiny change in the verilog could make it return 
OxFFFFFFFF’s for reads when you can’t read the CTB. This would be clearly different than a failure to 
trigger collection, and is an almost-impossible long series of values for any CTB to collect. 
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12. 


13. 


14. 


15. 


16. 
17. 


18. 


19. 


Stopping LAC Stops Collection. Have a transition of the GO bit 1 -> 0 cause the CLRCOLL action. This 
eliminates the hazard of someone stopping the LAC program manually by clearing the GO bit, but then being 
unable to read any CTB contents because CollectTrace is still ON. Have this be by 1 -> 0 transition, not by 
GO==0, so we can have the previously-mentioned registers that turn on and off collection. The way OLCA 
is now it can be very irritating if you happened to shut off LAC by writing 0 to the GO bit when collection 
was ON. There’s no straightforward way to shut off collection of all enabled CTBs by register-write, you can 
only shut them off by opcode CLRCOLL in a running LAC program. This is no problem when the next 
LAC program you wish to run is of the CTB StopOnFull=0 unqualified style, but if you are doing qualified 
collection with StopOnFull=1 and you want to start at CTB address=0 it can be a problem. You might think 
you could just begin every LAC program with a CLRCOLL and your problems would be solved, but there’s 
no way inside a LAC program to clear a CTB’s WtAddr. 


Move Delay Registers into the Trigger Blocks. Having the Delay Registers centralized in LAC means they’re 
all flopped in cclk domain. FSW triggers and trigger blocks are in sclk domain. To be able to line-up FSW 
signals into a complex trigger is hard, although this was partly solved by providing some FSW trigger signals 
to it’s trigger blocks more than once, with different sclk delays. The best solution to this is to have the delay 
registers in the Trigger Blocks, not centralized in the LAC. 


More External-Mux Values, or Extra Mux in FSW. Boost the number of bits to control external muxes from 
3 to 4 or 5. Do this for all types of trigger and collector blocks. Almost no extra logic is created by this except 
in those blocks where the extra external-mux-value options are used. The motivation for this is with regard 
to the Link side of FSW. Currently OCLA in FSW only looks at FLR-0 and FLT-0 signals, due to mux-value 
limitations. For better board and system debug, to use OCLA freely to see damaged traffic arriving any one 
particular link, we really want all 6 links covered by OCLA. (b) Another way to get all 6 Link interfaces in 
FSW into OCLA, without changing OCLA Trigger or Collector blocks, is to just put a new register into FSW. 
This register in FSW’s register address space would take values of 0, 1, or 2, and would drive a first level 
of muxing, selecting which link-number provides FLR and FLT signals to the current OCLA-register-driven 
external muxes. 


More Collection Qualifiers. CTBs currently allow up-to 2 Qualifier signals. In some uses of CTBs there were 
more signals that would be handy to have available as qualifiers. The external mux selecting data for a CTB 
often selects between a good number of unrelated interfaces. In a number of cases you just accept that you 
have to do un-qualified collection, because the 2 qualifiers provided are not relevant to the interface or signals 
you are looking at. 


More CTB Qualifier Inputs. Perhaps 4. 


Use External Mux on Qualifiers. When instantiating CTBs, follow the example of how FSW Vector Trigger 
Blocks are instantiated, where the external mux selectors vary both the data and the qualifier to be used. 


Eliminate Qualifiers in Codeword Trigger Blocks. The way Codeword Trigger Blocks work, all the trigger 
inputs are effectively qualifiers on each other. There’s no reason to handle some inputs differently and call 
them “qualifiers”. 


Widen Vector Trigger Blocks to 64-Bits. FSW is really the only place where Vector Trigger Blocks are used, 
because the way they’re used in DMA is more naturally served by Codeword Trigger Blocks. In FSW the 
natural width of the busses looked-at is 64 bits. It would be a usage simplification if the Trigger Block just 
looked at the 64 bits. 
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